{"title": "Learning from Small Sample Sets by Combining Unsupervised Meta-Training with CNNs", "book": "Advances in Neural Information Processing Systems", "page_first": 244, "page_last": 252, "abstract": "This work explores CNNs for the recognition of novel categories from few examples. Inspired by the transferability properties of CNNs, we introduce an additional unsupervised meta-training stage that exposes multiple top layer units to a large amount of unlabeled real-world images. By encouraging these units to learn diverse sets of low-density separators across the unlabeled data, we capture a more generic, richer description of the visual world, which decouples these units from ties to a specific set of categories. We propose an unsupervised margin maximization that jointly estimates compact high-density regions and infers low-density separators. The low-density separator (LDS) modules can be plugged into any or all of the top layers of a standard CNN architecture. The resulting CNNs significantly improve the performance in scene classification, fine-grained recognition, and action recognition with small training samples.", "full_text": "Learning from Small Sample Sets by Combining\n\nUnsupervised Meta-Training with CNNs\n\nYu-Xiong Wang\n\nMartial Hebert\n\nRobotics Institute, Carnegie Mellon University\n\n{yuxiongw, hebert}@cs.cmu.edu\n\nAbstract\n\nThis work explores CNNs for the recognition of novel categories from few exam-\nples. Inspired by the transferability properties of CNNs, we introduce an additional\nunsupervised meta-training stage that exposes multiple top layer units to a large\namount of unlabeled real-world images. By encouraging these units to learn diverse\nsets of low-density separators across the unlabeled data, we capture a more generic,\nricher description of the visual world, which decouples these units from ties to a\nspeci\ufb01c set of categories. We propose an unsupervised margin maximization that\njointly estimates compact high-density regions and infers low-density separators.\nThe low-density separator (LDS) modules can be plugged into any or all of the\ntop layers of a standard CNN architecture. The resulting CNNs signi\ufb01cantly im-\nprove the performance in scene classi\ufb01cation, \ufb01ne-grained recognition, and action\nrecognition with small training samples.\n\n1 Motivation\n\nTo successfully learn a deep convolutional neural network (CNN) model, hundreds of millions of\nparameters need to be inferred from millions of labeled examples on thousands of image categories [1,\n2, 3]. In practice, however, for novel categories/tasks of interest, collecting a large corpus of annotated\ndata to train CNNs from scratch is typically unrealistic, such as in robotics applications [4] and for\ncustomized categories [5]. Fortunately, although trained on particular categories, CNNs exhibit certain\nattractive transferability properties [6, 7]. This suggests that they could serve as universal feature\nextractors for novel categories, either as off-the-shelf features or through \ufb01ne-tuning [7, 8, 9, 10].\nSuch transferability is promising but still restrictive, especially for novel-category recognition from\nfew examples [11, 12, 13, 14, 15, 16, 17, 18]. The overall generality of CNNs is negatively affected\nby the specialization of top layer units to their original task. Recent analysis shows that from bottom,\nmiddle, to top layers of the network, features make a transition from general to speci\ufb01c [6, 8]. While\nfeatures in the bottom and middle layers are fairly generic to many categories (i.e., low-level features\nof Gabor \ufb01lters or color blobs and mid-level features of object parts), high-level features in the top\nlayers eventually become speci\ufb01c and biased to best discriminate between a particular set of chosen\ncategories. With limited samples from target tasks, \ufb01ne-tuning cannot effectively adjust the units and\nwould result in over-\ufb01tting, since it typically requires a signi\ufb01cant amount of labeled data. Using\noff-the-shelf CNNs becomes the best strategy, despite the specialization and reduced performance.\nIn this work we investigate how to improve pre-trained CNNs for the learning from few examples.\nOur key insight is to expose multiple top layer units to a massive set of unlabeled images, as shown\nin Figure 1, which decouples these units from ties to the original speci\ufb01c set of categories. This\nadditional stage is called unsupervised meta-training to distinguish this phase from the conventional\nunsupervised pre-training phase [19] and the training phase on the target tasks. Based on the above\ntransferability analysis, intuitively, bottom and middle layers construct a feature space with high-\ndensity regions corresponding to potential latent categories. Top layer units in the pre-trained CNN,\nhowever, only have access to those regions associated with the original, observed categories. The\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: We aim to improve the transferability of pre-trained CNNs for the recognition of novel\ncategories from few labeled examples. We perform a multi-stage training procedure: 1) We \ufb01rst\npre-train a CNN that recognizes a speci\ufb01c set of categories on a large-scale labeled dataset (e.g.,\nImageNet 1.2M), which provides fairly generic bottom and middle layer units; 2) We then meta-train\nthe top layers as low-density separators on a far larger set of unlabeled data (e.g., Flickr 100M), which\nfurther improves the generality of multiple top layer units; 3) Finally, we use our modi\ufb01ed CNN\non new categories/tasks (e.g., scene classi\ufb01cation, \ufb01ne-grained recognition, and action recognition),\neither as off-the-shelf features or as initialization of \ufb01ne-tuning that allows for end-to-end training.\nunits are then tuned to discriminate between these regions by separating the regions while pushing\nthem further away from each other. To tackle this limitation, our unsupervised meta-training provides\na far larger pool of unlabeled images as a much less biased sampling in the feature space. Now,\ninstead of producing separations tied to the original categories, we generate diverse sets of separations\nacross the unlabeled data. Since the unit \u201ctries to discriminate the data manifold from its surroundings,\nin all non-manifold directions\u201d1, we capture a more generic and richer description of the visual world.\nHow can we generate these separations in an unsupervised manner? Inspired by the structure/manifold\nassumption in shallow semi-supervised and unsupervised learning (i.e., the decision boundary should\nnot cross high-density regions, but instead lie in low-density regions) [20, 21], we introduce a low-\ndensity separator (LDS) module that can be plugged into any (or all) top layers of a standard CNN\narchitecture. More precisely, the vector of weights connecting a unit to its previous layer (together\nwith the non-linearity) can be viewed as a separator or decision boundary in the activation space of\nthe previous layer. LDS then generates connection weights (decision boundaries) between successive\nlayers that traverse regions of as low density as possible and avoid intersecting high-density regions in\nthe activation space. Many LDS methods typically infer a probability distribution, for example through\ndensest region detection, lowest-density hyperplane estimation [21], and clustering [22]. However,\nexact clustering or density estimation is known to be notoriously dif\ufb01cult in high-dimensional spaces.\nWe instead adopt a discriminative paradigm [20, 23, 24, 14] to circumvent the aforementioned\ndif\ufb01culties. Using a max-margin framework, we propose an unsupervised, scalable, coarse-to-\ufb01ne\napproach that jointly estimates compact, distinct high-density quasi-classes (HDQC), i.e., sets of data\npoints sampled in high-density regions, as stand-ins for plausible high-density regions and infers low-\ndensity hyperplanes (separators). Our decoupled formulations generalize those in supervised binary\ncode discovery [23] and semi-supervised learning [24], respectively; and more crucially, we propose\na novel combined optimization to jointly estimate HDQC and learn LDS in large-scale unsupervised\nscenarios, from the labeled ImageNet 1.2M [25] to the unlabeled Flickr 100M dataset [26].\nOur approach of exploiting unsupervised learning on top of CNN transfer learning is unique as op-\nposed to other recent work on unsupervised, weakly-supervised, and semi-supervised deep learning.\nMost existing unsupervised deep learning approaches focus on unsupervised learning of visual repre-\nsentations that are both sparse and allow image reconstruction [19], including deep belief networks\n(DBN), convolutional sparse coding, and (denoising) auto-encoders (DAE). Our unsupervised LDS\nmeta-training is different from conventional unsupervised pre-training as in DBN and DAE in two\nimportant ways: 1) our meta-training \u201cpost-arranges\u201d the network that has undergone supervised\ntraining on a labeled dataset and then serves as a kind of network \u201cpre-conditioner\u201d [19] for the target\ntasks; and 2) our meta-training phase is not necessarily followed by \ufb01ne-tuning and the features\nobtained by meta-training could be used off the shelf.\nOther types of supervisory information (by creating auxiliary tasks), such as clustering, surrogate\nclasses [27, 4], spatial context, temporal consistency, web supervision, and image captions [28], have\nbeen explored to train CNNs in an unsupervised (or weakly-supervised) manner. Although showing\n\n1Yoshua Bengio. https://disqus.com/by/yoshuabengio/\n\n2\n\n1.2\t\r \u00a0M\t\r \u00a0100\t\r \u00a0M\t\r \u00a0Supervised Pre-Training of Bottom and Middle LayersUnsupervised Meta-Training of Top LayersNovel Category Recognitionfrom Few Examples\u00e8\uf0e8\t\r \u00a0\u00e8\uf0e8\t\r \u00a0\finitial promise, the performance of these unsupervised (or weakly-supervised) deep models is still\nnot on par with that of their supervised counterparts, partially due to noisy or biased external informa-\ntion [28]. In addition, our LDS, if viewed as an auxiliary task, is directly related to discriminative\nclassi\ufb01cation, which results in more desirable and consistent features for the \ufb01nal novel-category\nrecognition tasks. Unlike using a single image and its pre-de\ufb01ned transformations [27] or other\nlabeled multi-view object [4] to simulate a surrogate class, our quasi-classes capture a more natural\nrepresentation of realistic images. Finally, while we boost the overall generality of CNNs for a wide\nspectrum of unseen categories, semi-supervised deep learning approaches typically improve the\nmodel generalization for speci\ufb01c tasks, with both labeled and unlabeled data coming from the tasks\nof interest [29, 30].\nOur contribution is three-fold: First, we show how LDS, based on an unsupervised margin maxi-\nmization, is generated without a bias to a particular set of categories (Section 2). Second, we detail\nhow to use LDS modules in CNNs by plugging them into any (or all) top layers of the architecture,\nleading to single-scale (or multi-scale) low-density separator networks (Section 3). Finally, we show\nhow such modi\ufb01ed CNNs, with enhanced generality, are used to facilitate the recognition of novel\ncategories from few examples and signi\ufb01cantly improve the performance in scene classi\ufb01cation,\n\ufb01ne-grained recognition, and action recognition (Section 4). The general setup is depicted in Figure 1.\n2 Pre-trained low-density separators from unsupervised data\n\nGiven a CNN architecture pre-trained on a speci\ufb01c set of categories, such as the ImageNet (ILSVRC)\n1,000 categories, we aim to improve the generality of one of its top layers, e.g., the k-th layer. We\n\ufb01x the structures and weights of the layers from 1 to k\u22121, and view the activation of layer k\u22121 as\na feature space. A unit s in layer k is fully connected to all the units in layer k\u22121 via a vector of\nweights ws. Each ws corresponds to a particular decision boundary (partition) of the feature space.\nIntuitively, all the ws\u2019s then jointly further discriminate between these 1,000 categories, enforcing that\nthe new activations in layer k are more similar within classes and more dissimilar between classes.\nTo make ws\u2019s and the associated units in layer k unspeci\ufb01c to the ImageNet 1,000 categories, we use\na large amount of unlabeled images at the unsupervised meta-training stage. The layers from 1 to\nk\u22121 remain unchanged, which means that we still tackle the same feature space. The new unlabeled\nimages now constitute a less biased sampling of the feature space in layer k\u22121. We introduce a\nnew k-th layer with more units and encourage their unbiased exploration of the feature space. More\nprecisely, we enforce that the units learn many diverse decision boundaries ws\u2019s that traverse different\nlow-density regions while avoiding intersecting high-density regions of the unsupervised data (untied\nto the original ImageNet categories). The set of possible arrangements of such decision boundaries is\nrich, meaning that we can potentially generalize to a broad range of categories.\n2.1 Approach overview\nWe denote column vectors and matrices with italic bold letters. For each unlabeled image Ii, where\ni \u2208 {1, 2, . . . , N}, let xi \u2208 RD and \u03c6i \u2208 RS be the vectorized activations in layers k\u2212 1 and k,\nrespectively. Let W be the weights between the two layers, where ws is the weight vector associated\nwith the unit s in layer k. For notational simplicity, xi already includes a constant 1 as the last element\nand ws includes the bias term. We then have \u03c6s\nsuch as sigmoid or ReLU. The resulting activation spaces of layers k\u22121 and k are denoted as X and\nF, respectively.\nTo learn ws\u2019s as low-density separators, we are supposed to have certain high-density regions which\nws\u2019s separate. However, accurate estimation of high-density regions is dif\ufb01cult. We instead generate\nquasi-classes as stand-ins for plausible high-density regions. We want samples with the same quasi-\nlabels to be similar in activation spaces (constraint within quasi-classes), while those with different\nquasi-labels should be very dissimilar in activation spaces (constraints between quasi-classes). Note\nthat in contrast to clustering, generating quasi-classes does not require inferring membership for each\ndata point. Formally, assuming that there are C desired quasi-classes, we introduce a sample selection\nvector Tc \u2208 {0, 1}N for each quasi-class c. Tc,i = 1 if Ii is selected for assignment to quasi-class c\nand zero otherwise. As illustrated in Figure 4, the optimization for seeking low-density separators\n(LDS) while identifying high-density quasi-classes (HDQC) can be framed as\n\n(cid:1), where f (\u00b7) is a non-linear function,\n\ni = f(cid:0)wsT xi\n\n\ufb01nd W \u2208 LDS, T \u2208 HDQC\n\nsubject to W separate T .\n\n(1)\n\nThis optimization problem enforces that each unit s learns a partition ws lying across the low-density\nregion among certain salient high-density quasi-classes discovered by T . This leads to a dif\ufb01cult\njoint optimization problem in theory, because W and T are interdependent.\n\n3\n\n\fIn practice, however, it may be unnecessary to \ufb01nd the global optimum. Reasonable local optima are\nsuf\ufb01cient in our case to describe the feature space, as shown by the empirical results in Section 4.\nWe use an iterative approach that obtains salient high-density quasi-classes from coarse to \ufb01ne\n(Section 2.3) and produces promising discriminative low-density partitions among them (Section 2.2).\nWe found that the optimization procedures converge in our experiments.\n2.2 Learning low-density separators\n\nAssume that T is known, which means that we have already de\ufb01ned C high-density quasi-classes\nby Tc. We then use a max-margin formulation to learn W . Each unit s in layer k corresponds to a\nlow-density hyperplane ws that separates positive and negative examples in a max-margin fashion.\nTo train ws, we need to generate label variables ls \u2208 {\u22121, 1} for each ws, which label the samples in\nthe quasi-classes either as positive (1) or negative (\u22121) training examples. We can stack all the labels\ninduced by the activation space X of layer k\u22121 and ws, it would be bene\ufb01cial to further push for large\ninter-quasi-class and small intra-quasi-class distances. We achieve such properties by optimizing\n\nfor learning ws\u2019s to form L =(cid:2)l1, . . . , lS(cid:3). Moreover, in the activation space F of layer k, which is\n\nmin\n\nW ,L,\u03a6\n\nS(cid:88)\n\ns=1\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\n(cid:107)ws(cid:107)2+\u03b7\n\nC(cid:88)\n\nS(cid:88)\n\ns=1\n\n+\n\n\u03bb1\n2\n\nc=1\n\nu=1\nv=1\n\n(cid:104)\n\n(cid:16)\n\n1\u2212ls\n\ni\n\nIi\n\nwsTxi\n\n(cid:17)(cid:105)\n\nTc,uTc,vd (\u03c6u, \u03c6v) \u2212 \u03bb2\n2\n\n+\n\nC(cid:88)\n\nC(cid:88)\n\nN(cid:88)\n\nc(cid:48)=1\n\nc(cid:48)(cid:48)=1\nc(cid:48)(cid:48)(cid:54)=c(cid:48)\n\np=1\nq=1\n\nTc(cid:48),pTc(cid:48)(cid:48),qd (\u03c6p, \u03c6q),\n\n(2)\n\n(cid:80)C\n\nwhere d is a distance metric (e.g., square of Euclidean distance) in the activation space F of layer\nk. [x]+ = max (0, x) represents the hinge loss. Here we introduce an additional indicator vector\nI \u2208 {0, 1}N for all the quasi-classes. Ii = 0 if Ii is not selected for assignment to any quasi-class (i.e.,\nc=1Tc,i = 0) and one otherwise. Note that I is actually sparse, since only a portion of unlabeled\n\nsamples are selected as quasi-classes and only their memberships are estimated in T .\nThe new objective is much easier to optimize compared to Eqn. (1), as it only requires producing\nthe low-density separators ws from known quasi-classes given Tc. We then derive an algorithm to\noptimize problem (2) using block coordinate descent. Speci\ufb01cally, problem (2) can be viewed as a\ngeneralization of predictable discriminative binary codes in [23]: 1) compared with the fully labeled\ncase in [23], Eqn. (2) introduces additional quasi-class indicator variables to handle the unsupervised\nscenario; 2) Eqn. (2) extends the speci\ufb01c binary-valued hash functions in [23] to general real-valued\nnon-linear activation functions in neural networks.\nWe adopt a similar iterative optimization strategy as in [23]. To achieve a good local minimum, our\ninsight is that there should be diversity in ws\u2019s and we thus initialize ws\u2019s as the top-S orthogonal\ndirections of PCA on data points belonging to the quasi-classes. We found that this initialization\nyields promising results that work better than random initialization and do not contaminate the\npre-trained CNNs. For \ufb01xed W , we update \u03a6 using stochastic gradient descent to achieve improved\nseparation in the activation space F of layer k. This optimization is ef\ufb01cient if using ReLU as\nnon-linearity. We use \u03a6 to update L. ls\ni > 0 and zero otherwise. Using L as training labels,\nwe then train S linear SVMs to update W . We iterate this process a \ufb01xed number of times\u20142 \u223c 4 in\npractice, and we thus obtain the low-density separator ws for each unit and construct the activation\nspace F of layer k.\n2.3 Generating high-density quasi-classes\n\ni = 1 if \u03c6s\n\nIn the previous section, we assumed T known and learned low-density separators between high-\ndensity quasi-classes. Now we explain how to \ufb01nd these quasi-classes. Given the activation space\nX of layer k\u22121 and the activation space F of layer k (linked by the low-density separators W as\nweights), we need to generate C high-density quasi-classes from the unlabeled data selected by Tc.\nWe hope that the quasi-classes are distinct and compact in the activation spaces. That is, we want\nsamples belonging to the same quasi-classes to be close to each other in the activation spaces, while\nsamples from different quasi-classes should be far from each other in the activation spaces. To this\nend, we propose a coarse-to-\ufb01ne procedure that combines the seeding heuristics of K-means++ [31]\nand a max-margin formulation [24] to gradually augment con\ufb01dent samples into the quasi-classes.\nWe suppose that each quasi-class contains at least \u03c40 images and at most \u03c4 images. Learning T\nincludes the following steps:\nSkeleton Generation. We \ufb01rst choose a single seed point Tc,ic = 1 for each quasi-class using the\nK-means++ heuristics in the activation space X of layer k\u22121. All the seed points are now spread out\nas the skeleton of the quasi-classes.\n\n4\n\n\f(a)\n\n(b)\n\nFigure 2: We use our LDS to revisit CNN architectures. In Figure 2a, we embed LDS learned from\na large collection of unlabeled data as a new top layer into a standard CNN structure pre-trained\non a speci\ufb01c set of categories (left), leading to single-scale LDS+CNN (middle). LDS could be\nalso embedded into different layers, resulting multi-scale LDS+CNN (right). More speci\ufb01cally in\nFigure 2b, our multi-scale LDS+CNN architecture is constructed by introducing LDS layers into\nmulti-scale DAG-CNN [10]. For each scale (level), we spatially (average) pool activations, learn and\nplug in LDS in this activation space, add fully-connected layers FCa and FCb (with K outputs), and\n\ufb01nally add the scores across all layers as predictions for K output classes (that are \ufb01nally soft-maxed\ntogether) on the target task. We show that the resulting LDS+CNNs can be either used as off-the-shelf\nfeatures or discriminatively trained in an end-to-end fashion to facilitate novel category recognition.\n\nQuasi-Class Initialization. We extend each single skeletal point to an initial quasi-class by adding\nits nearest neighbors [31] in the activation space X of layer k\u22121. Each of the resulting quasi-classes\nthus contains \u03c40 images, which satis\ufb01es the constraint for the minimum number of selected samples.\nAugmentation and Re\ufb01nement. In the above two steps, we select samples for quasi-classes based\non the similarity in the activation space of layer k\u22121. Given this initial estimate of quasi-classes,\nwe select additional samples using joint similarity in both activation spaces of layers k\u22121 and k by\nleveraging a max-margin formulation. For each quasi-class c, we construct quasi-class classi\ufb01ers hX\nand hF\nc are different from the low-density separator\nws. We use SVM responses to select additional samples, leading to the following optimization:\n\nc in the two activation spaces. Note that hX\n\nc and hF\n\nc\n\nc\n\nT ,hX\n\n\u03b1\n\nmin\nc ,hF\n\nC(cid:88)\nC(cid:88)\ns.t. \u03c40 \u2264 N(cid:88)\n\n+\u03b2\n\nc=1\n\nc=1\n\n(cid:32)(cid:13)(cid:13)(cid:13)h\n(cid:32)(cid:13)(cid:13)(cid:13)h\n\nX\nc\n\nF\nc\n\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n2\n\n(cid:104)\n(cid:104)\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\nIi\n\nIi\n\n+ \u03bbX\n\n+ \u03bbF\n\nTc,i \u2264 \u03c4,\u2200c \u2208 {1, . . . , C},\n\n(cid:16)\n(cid:16)\n\n1 \u2212 yc,i\n\nX T\nc xi\n\nh\n\n1 \u2212 yc,i\n\nF T\nc \u03c6i\n\nh\n\n(cid:33)\nC(cid:88)\n\u2212 N(cid:88)\n\nc(cid:48)=1\n\n+\n\n(cid:17)(cid:105)\n(cid:17)(cid:105)\n\n+\n\n+\n\nN(cid:88)\n\nj=1\n\nC(cid:88)\n(cid:16)\n\nc(cid:48)(cid:48)=1\nc(cid:48)(cid:54)=c(cid:48)(cid:48)\n\nTc,j\n\nF T\nc \u03c6j\n\nh\n\nTc(cid:48),jTc(cid:48)(cid:48),j\n\n(cid:17)(cid:33)\n\nj=1\n\n(3)\n\ni=1\n\nwhere yc,i is the corresponding binary label used for one-vs.-all multi-quasi-class classi\ufb01cation:\nyc,i = 1 if Tc,i = 1 and \u22121 otherwise. The \ufb01rst and second terms denote a max-margin classi\ufb01er in\nthe activation space X , and the fourth and \ufb01fth terms denote a max-margin classi\ufb01er in the activation\nspace F. The third term ensures that the same unlabeled sample is not shared by multiple quasi-\nclasses. The last term is a sample selection criterion that chooses those unlabeled samples with high\nclassi\ufb01er responses in the activation space F.\nThis formulation is inspired by the approach to selecting unlabeled images using joint visual features\nand attributes [24]. We view our activation space X of layer k\u2212 1 as the feature space, and the\nactivation space F of layer k as the learned attribute space. However, different from the semi-\nsupervised scenario in [24], which provides an initially labeled training images, our problem (3) is\nentirely unsupervised. To solve it, we use initial T corresponding to the quasi-classes obtained in\nthe \ufb01rst two steps to train hX\nc . After obtaining these two sets of SVMs in both activation\nspaces, we update T . Following a similar block coordinate descent procedure as in [24], we iteratively\nre-train both hX\n3 Low-density separator networks\n3.1 Single-scale layer-wise training\n\nc and update T until we obtain the desired \u03c4 number of samples.\n\nc and hF\n\nc and hF\n\nWe start from how to embed our LDS as a new top layer into a standard CNN structure, leading to\nsingle-scale network. To improve the generality of the learned units in layer k, we need to prevent\nco-adaptation and enforce diversity between these units [6, 19]. We adopt a simple random sampling\nstrategy to train the entire LDS layer. We break the units in layer k into (disjoint) blocks, as shown\n\n5\n\nCNNSingle-Scale LDS+CNNLayer Layer 2Layer1Output LayerMulti-Scale LDS+CNNLDS LayerLayer 2Layer 1Output LayerLDS LayerLayer 1Output LayerLDS Layer KLayer KLayer k\u2026\u2026ConvMax-poolAvg-pool\u00e8\uf0e8\u00e8\uf0e8\u00ea\uf0eaLayeriConvAvg-pool\u00e8\uf0e8\u00ea\uf0eaLayerKConvNormMax-poolAvg-pool\u00e8\uf0e8\u00e8\uf0e8\u00e8\uf0e8\u00ea\uf0ea\u00e8\uf0e8Layer LDSFCa\u00ea\uf0eaFCb\u00ea\uf0eaLDSFCa\u00ea\uf0eaFCbSoftmax\u00ea\uf0ea\u00ea\uf0ea\u00ea\uf0eaAddLDSFCa\u00ea\uf0eaFCb\u00ea\uf0eaReLUReLUReLUj\u00ea\uf0ea\u00ea\uf0ea\u00ea\uf0ea\fin Figure 4. We encourage each block of units to explore different regions of the activation space\ndescribed by a random subset of unlabeled samples. This sampling strategy also makes LDS learning\nscalable since direct LDS learning from the entire dataset is computationally infeasible.\nSpeci\ufb01cally, from an original selection matrix T0 \u2208 {0, 1}N\u00d7C of all zeros, we \ufb01rst obtain a random\nsub-matrix T \u2208 {0, 1}M\u00d7C. Using this subset of M samples, we then generate C high-density\nquasi-classes by solving the problem (3) and learn S corresponding low-density separator weights by\nsolving the problem (2), yielding a block of S units in layer k. We randomly produce J sub-matrices\nT , repeat the procedure, and obtain S\u00d7J units (J blocks) in total. This thus constitutes layer k, the\nlow-density separator layer. The entire single-scale structure is shown in Figure 2a.\n3.2 Multi-scale structure\nFor a convolutional layer of size H1\u00d7H2\u00d7F , where H1 is the height, H2 is the width, and F is the\nnumber of \ufb01lter channels, we \ufb01rst compute a 1\u00d71\u00d7F pooled feature by averaging across spatial\ndimensions as in [10], and then learn LDS in this activation space as before. Note that our approach\napplies to other types of pooling operation as well. Given the bene\ufb01t of complementary features, LDS\ncould also be operationalized on several different layers, leading to multi-scale/level representations.\nWe thus modify the multi-scale DAG-CNN architecture [10] by introducing LDS on top of the ReLU\nlayers, leading to multi-scale LDS+CNN, as shown in Figure 2b. We add two additional layers on\ntop of LDS: FCa (with F outputs) that selects discriminative units for target tasks, and FCb (with K\noutputs) that learns K-way classi\ufb01er for target tasks. The output of the LDS layers could be used\nas off-the-shelf multi-scale features. If using LDS weights as initialization, the entire structure in\nFigure 2b could also be \ufb01ne-tuned in a similar fashion as DAG-CNN [10].\n4 Experimental evaluation\n\nIn this section, we explore the use of low-density separator networks (LDS+CNNs) on a number of\nsupervised learning tasks with limited data, including scene classi\ufb01cation, \ufb01ne-grained recognition,\nand action recognition. We use two powerful CNN models\u2014AlexNet [1] and VGG19 [3] pre-trained\non ILSVRC 2012 [25], as our reference networks. We implement the unsupervised meta-training\non Yahoo! Flickr Creative Commons100M dataset (YFCC100M) [26], which is the largest single\npublicly available image and video database. We begin with plugging LDS into a single layer, and\nthen introduce LDS into several top layers, leading to a multi-scale model. We consider using\nLDS+CNNs as off-the-shelf features in the small sample size regime, as well as through \ufb01ne-tuning\nwhen enough data is available in the target task.\nImplementation Details. During unsupervised meta-training, we use 99.2 million unlabeled images\non YFCC100M [26]. After resizing the smallest side of each image to be 256, we generate the\nstandard 10 crops (4 corners plus one center and their \ufb02ips) of size 224\u00d7 224 as implemented in\nCaffe [32]. For single-scale structures, we learn LDS in the f c7 activation space of dimension\n4,096. For multi-scale structures, following [10] we learn LDS in activation spaces of Conv3, Conv4,\nConv5, f c6, and f c7 for AlexNet, and we learn LDS in activation spaces of Conv43, Conv44, Conv51,\nConv52, and f c6 for VGG19. We use the same sets of parameters to learn LDS in these activation\nspaces without further tuning. In the LDS layer, each block has S = 10 units, which separate across\nM = 20,000 randomly sub-sampled data points. Repeating J = 2,000 sub-sampling, we then have\n20,000 units in total. Notably, each block of units in the LDS layer can be learned independently,\nmaking feasible for parallelization. For learning LDS in Eqn. (2), \u03b7 and \u03bb1 are set to 1 and \u03bb2 is set to\nnormalize for the size of quasi-classes, which is the same setup and default parameters as in [23].\nFor generating high-density quasi-classes in Eqn. (3), following [31, 24], we set the minimum and\nmaximum number of selected samples per quasi-classes to be \u03c40 = 6 and \u03c4 = 56, and produce C = 30\nquasi-classes in total. We use the same setup and parameters as in [24], where \u03b1 = 1, \u03b2 = 1. While\nusing only the center crops to infer quasi-classes, we use all 10 crops to learn more accurate LDS.\nTasks and Datasets. We evaluate on standard benchmark datasets for scene classi\ufb01cation: SUN-\n397 [33] and MIT-67 [34], \ufb01ne-grained recognition: Oxford 102 Flowers [35], and action recognition\n(compositional semantic recognition): Stanford-40 actions [36]. These datasets are widely used\nfor evaluating the CNN transferability [8], and we consider their diversity and coverage of novel\ncategories. We follow the standard experimental setup (e.g., the train/test splits) for these datasets.\n4.1 Learning from few examples\n\nThe \ufb01rst question to answer is whether the LDS layers improve the transferability of the original\npre-trained CNNs and facilitate the recognition of novel categories from few examples. To answer this\n\n6\n\n\fFigure 3: Performance comparisons between our single-scale LDS+CNN (SS-LDS+CNN), multi-\nscale LDS+CNN (MS-LDS+CNN) and the pre-trained single-scale CNN (SS-CNN), multi-scale\nDAG-CNN (MS-DAG-CNN) baselines for scene classi\ufb01cation, \ufb01ne-grained recognition, and action\nrecognition from few labeled examples on four benchmark datasets. VGG19 [3] is used as the\nCNN model for its demonstrated superior performance. For SUN-397, we also include a publicly\navailable strong baseline, Places-CNN, which trained a CNN (AlexNet architecture) from scratch\nusing a scene-centric database with over 7 million annotated images from 400 scene categories, and\nwhich achieved state-of-the-art performance for scene classi\ufb01cation [2]. X-axis: number of training\nexamples per class. Y-axis: average multi-class classi\ufb01cation accuracy. With improved transferability\ngained from a large set of unlabeled data, our LDS+CNNs with simple linear SVMs signi\ufb01cantly\noutperform the vanilla pre-trained CNN and powerful DAG-CNN for small sample learning.\n\nType\n\nWeakly-supervised\n\nCNNs\n\nOurs\n\nApproach\nFlickr-AlexNet\nFlickr-GoogLeNet\nCombined-AlexNet\nCombined-GoogLeNet\nSS-LDS+CNN\nMS-LDS+CNN\n\n42.7\n44.4\n47.3\n55.0\n55.4\n59.9\n\nSUN-397 MIT-67\n\n55.8\n55.6\n58.8\n67.9\n73.6\n80.2\n\n102 Flowers\n\n74.2\n65.8\n83.3\n83.7\n87.5\n95.4\n\nStanford-40\n\n53.0\n52.8\n56.4\n69.2\n70.5\n72.6\n\nTable 1: Performance comparisons of classi\ufb01cation accuracy (%) between our LDS+CNNs and\nweakly-supervised CNNs [28] on the four datasets when using the entire training sets. In contrast to\nour approach that uses the Flickr dataset for unsupervised meta-training, Flickr-AlexNet/GoogLeNet\ntrain CNNs from scratch on the Flickr dataset by using associated captions as weak supervisory\ninformation. Combined-AlexNet/GoogLeNet concatenate features from supervised ImageNet CNNs\nand weakly-supervised Flickr CNNs. Despite the same amount of data used for pre-training, ours\noutperform the weakly-supervised CNNs by a signi\ufb01cant margin due to their noisy captions and tags.\n\nquestion, we evaluate both LDS+CNN and CNN as off-the-shelf features without \ufb01ne-tuning on the\ntarget datasets. This is the standard way to use pre-trained CNNs [7]. We test how performance varies\nwith the number of training samples per category as in [16]. To compare with the state-of-the-art\nperformance, we use VGG19 in this set of experiments. Following the standard practice, we train\nsimple linear SVMs in one-vs.-all fashion on L2-normalized features [7, 10] in Liblinear [37].\nSingle-Scale Features. We begin by evaluating single-scale features on theses datasets. For a\nfair comparison, we \ufb01rst reduce the dimensionality of LDS+CNN from 20,000 to 4,096, the same\ndimensionality as CNN, followed by linear SVMs. This is achieved by selecting from LDS+CNN\nthe 4,096 most active features according to the standard criterion of multi-class recursive feature\nelimination (RFE) [38] using the target dataset. We also tested PCA. The performance drops, but it is\nstill signi\ufb01cantly better than the pre-trained CNN. Figure 3 summarizes the average performance over\n10 random splits on these datasets. When used as off-the-shelf features for small-sample learning,\nour single-scale LDS+CNN signi\ufb01cantly outperforms the vanilla pre-trained CNN, which is already a\nstrong baseline. Our results are particularly impressive for the big performance boost, for example\nnearly 20% on MIT-67, in the one-shot learning scenario. This veri\ufb01es the effectiveness of the\nlayer-wise LDS, which leads to a more generic representation for a broad range of novel categories.\n\n7\n\n1510205010203040506070Number of Training Examples per CategoryAccuracy (%)SUN\u2212397 MS\u2212LDS+CNNSS\u2212LDS+CNNMS\u2212DAG\u2212CNNSS\u2212CNNPlaces\u2212CNN1351015202530405080304050607080Number of Training Examples per CategoryAccuracy (%)MIT\u221267 MS\u2212LDS+CNNSS\u2212LDS+CNNMS\u2212DAG\u2212CNNSS\u2212CNN12345678910405060708090100Number of Training Examples per CategoryAccuracy (%)102 Flowers MS\u2212LDS+CNNSS\u2212LDS+CNNMS\u2212DAG\u2212CNNSS\u2212CNN13510203040506070809010020304050607080Number of Training Examples per CategoryAccuracy (%)Stanford\u221240 MS\u2212LDS+CNNSS\u2212LDS+CNNMS\u2212DAG\u2212CNNSS\u2212CNN\fFigure 5: Effect of \ufb01ne-tuning (FT) on SUN-397\n(purple bars) and MIT-67 (blue bars). Fine-tuning\nLDS+CNNs (AlexNet) further improves the per-\nformance over the off-the-shelf (OTS) features for\nnovel category recognition.\n\nFigure 4:\nIllustration of learning low-density\nseparators between successive layers on a large\namount of unlabeled data. Note the color cor-\nrespondence between the decision boundaries\nacross the unlabeled data and the connection\nweights in the network.\nMulti-Scale Features. Given the promise of single-scale LDS+CNN, we now evaluate multi-scale\noff-the-shelf features. After learning LDS in each activation space separately, we reduce their\ndimensionality to that of the corresponding activation space via RFE for a fair comparison with DAG-\nCNN [10]. We train linear SVMs on these LDS+CNNs, and then average their predictions. Figure 3\nsummarizes the average performance over different splits for multi-scale features. Consistent with\nthe single-scale results, our multi-scale LDS+CNN outperforms the powerful multi-scale DAG-CNN.\nLDS+CNN is especially bene\ufb01cial to \ufb01ne-grained recognition, since there is typically limited data\nper class for \ufb01ne-grained categories. Figure 3 also validates that multi-scale LDS+CNN allows for\ntransfer at different levels, thus leading to better generalization to novel recognition tasks compared\nto its single-scale counterpart. In addition, Table 1 further shows that our LDS+CNNs outperform\nweakly-supervised CNNs [28] that are directly trained on Flickr using external caption information.\n4.2 Fine-tuning\n\nWith more training data available in the target task, our LDS+CNNs could be \ufb01ne-tuned to further\nimprove the performance. For ef\ufb01cient and easy \ufb01ne-tuning, we use AlexNet in this set of experiments\nas in [10]. We evaluate the effect of \ufb01ne-tuning of our single-scale and multi-scale LDS+CNNs in\nthe scene classi\ufb01cation tasks, due to their relatively large number of training samples. We compare\nagainst the \ufb01ne-tuned single-scale CNN and multi-scale DAG-CNN [10], as shown in Figure 5.\nFor completeness, we also include their off-the-shelf performance. As expected, \ufb01ne-tuned models\nconsistently outperform their off-the-shelf counterparts. Importantly, Figure 5 shows that our approach\nis not limited to small-sample learning and is still effective even in the many training examples regime.\n5 Conclusions\n\nEven though current large-scale annotated datasets are comprehensive, they are only a tiny sampling\nof the full visual world biased to a selection of categories. It is still not clear how to take advantage\nof truly large sets of unlabeled real-world images, which constitute a much less biased sampling\nof the visual world. In this work we proposed an approach to leveraging such unsupervised data\nsources to improve the overall transferability of supervised CNNs and thus to facilitate the recognition\nof novel categories from few examples. This is achieved by encouraging multiple top layer units\nto generate diverse sets of low-density separations across the unlabeled data in activation spaces,\nwhich decouples these units from ties to a speci\ufb01c set of categories. The resulting modi\ufb01ed CNNs\n(single-scale and multi-scale low-density separator networks) are fairly generic to a wide spectrum of\nnovel categories, leading to signi\ufb01cant improvement for scene classi\ufb01cation, \ufb01ne-grained recognition,\nand action recognition. The speci\ufb01c implementation described here is a \ufb01rst step. While we used\ncertain max-margin optimization to train low-density separators, it would be interesting to integrate\ninto the current CNN backpropagation framework both learning low-density separators and gradually\nestimating high-density quasi-classes.\nAcknowledgments. We thank Liangyan Gui, Carl Doersch, and Deva Ramanan for valuable and insightful\ndiscussions. This work was supported in part by ONR MURI N000141612007 and U.S. Army Research\nLaboratory (ARL) under the Collaborative Technology Alliance Program, Cooperative Agreement W911NF-10-\n2-0016. We also thank NVIDIA for donating GPUs and AWS Cloud Credits for Research program.\n\n8\n\n.\t\r \u00a0.\t\r \u00a0 .\t\r \u00a0\u2026\t\r \u00a0.\t\r \u00a0.\t\r \u00a0.\t\r \u00a0Layerk (LDS)Layerk\u22121Unsupervised dataA set of quasi-classesA block of unitsLow-densityseparators40455055606570SS-CNN-OTSSS-CNN-FTMS-DAG-CNN-OTSMS-DAG-CNN-FTSS-LDS+CNN-OTSSS-LDS+CNN-FTMS-LDS+CNN-OTSMS-LDS+CNN-FTAccuracy (%)SUN397MIT67\fReferences\n[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\n[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition\n\nnetworks. In NIPS, 2012.\n\nusing places database. In NIPS, 2014.\n\n[3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\n[4] D. Held, S. Thrun, and S. Savarese. Robust single-view instance recognition. In ICRA, 2016.\n[5] Y.-X. Wang and M. Hebert. Model recommendation: Generating object detectors from few samples. In\n\n[6] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In\n\nICLR, 2015.\n\nCVPR, 2015.\n\nNIPS, 2014.\n\n[7] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: An astounding\n\nbaseline for recognition. In CVPR Workshop, 2014.\n\n[8] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. Factors of transferability for a generic\n\nConvNet representation. TPAMI, 2015.\n\n[9] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations\n\nusing convolutional neural networks. In CVPR, 2014.\n\n[10] S. Yang and D. Ramanan. Multi-scale recognition with DAG-CNNs. In ICCV, 2015.\n[11] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In\n\n[12] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic\n\nprogram induction. Science, 350(6266):1332\u20131338, 2015.\n\n[13] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot\n\nICML Workshops, 2015.\n\nlearning. In NIPS, 2016.\n\n[14] Y.-X. Wang and M. Hebert. Learning by transferring from unsupervised universal sources. In AAAI, 2016.\n[15] Z. Li and D. Hoiem. Learning without forgetting. In ECCV, 2016.\n[16] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning.\n\nIn ECCV, 2016.\n\nIn NIPS, 2016.\n\n[17] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi. Learning feed-forward one-shot learners.\n\n[18] B. Hariharan and R. Girshick. Low-shot visual object recognition. arXiv preprint arXiv:1606.02819, 2016.\n[19] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in preparation for MIT Press, 2016.\n[20] O. Chapelle and A. Zien. Semi-supervised classi\ufb01cation by low density separation. In AISTATS, 2005.\n[21] S. Ben-david, T. Lu, D. P\u00e1l, and M. Sot\u00e1kov\u00e1. Learning low density separators. In AISTATS, 2009.\n[22] J. Hoffman, B. Kulis, T. Darrell, and K. Saenko. Discovering latent domains for multisource domain\n\nadaptation. In ECCV, 2012.\n\nIn ECCV, 2012.\n\nattributes. In CVPR, 2013.\n\n[23] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes.\n\n[24] J. Choi, M. Rastegari, A. Farhadi, and L. S. Davis. Adding unlabeled samples to categories by learned\n\n[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. IJCV,\n115(3):211\u2013252, 2015.\n\n[26] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. YFCC100M:\n\nThe new data in multimedia research. Communications of the ACM, 59(2):64\u201373, 2016.\n\n[27] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox. Discriminative unsupervised feature\n\nlearning with convolutional neural networks. In NIPS, 2014.\n\n[28] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly\n\n[29] J. Weston, F. Ratle, H. Mobahi, and R. Collobert. Deep learning via semi-supervised embedding. In ICML,\n\nsupervised data. In ECCV, 2016.\n\n2008.\n\n[30] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. P. Xing. Training hierarchical feed-forward visual recognition\n\nmodels using transfer learning from pseudo-tasks. In ECCV, 2008.\n\n[31] D. Dai and L. Van Gool. Ensemble projection for semi-supervised image classi\ufb01cation. In ICCV, 2013.\n[32] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. In ACM MM, 2014.\n\n[33] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva. SUN database: Exploring a large collection of\n\nscene categories. IJCV, 119(1):3\u201322, 2016.\n\n[34] A. Torralba and A. Quattoni. Recognizing indoor scenes. In CVPR, 2009.\n[35] M.-E. Nilsback and A. Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of classes. In\n\nICVGIP, 2008.\n\n[36] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning\n\nbases of action attributes and parts. In ICCV, 2011.\n\n[37] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\n[38] A. Bergamo and L. Torresani. Classemes and other classi\ufb01er-based features for ef\ufb01cient object categoriza-\n\nclassi\ufb01cation. JMLR, 9:1871\u20131874, 2008.\n\ntion. TPAMI, 36(10):1988\u20132001, 2014.\n\n9\n\n\f", "award": [], "sourceid": 159, "authors": [{"given_name": "Yu-Xiong", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Martial", "family_name": "Hebert", "institution": "Carnegie Mellon University"}]}