{"title": "Learning Deep Parsimonious Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 5076, "page_last": 5084, "abstract": "In this paper we aim at facilitating generalization for deep networks while supporting interpretability of the learned representations. Towards this goal, we propose a clustering based regularization that encourages parsimonious representations. Our k-means style objective is easy to optimize and flexible supporting various forms of clustering, including sample and spatial clustering as well as co-clustering. We demonstrate the effectiveness of our approach on the tasks of unsupervised learning, classification, fine grained categorization and zero-shot learning.", "full_text": "Learning Deep Parsimonious Representations\n\nRenjie Liao1, Alexander Schwing2, Richard S. Zemel1,3, Raquel Urtasun1\n\n{rjliao, zemel, urtasun}@cs.toronto.edu, aschwing@illinois.edu\n\nUniversity of Toronto1\n\nUniversity of Illinois at Urbana-Champaign2\nCanadian Institute for Advanced Research3\n\nAbstract\n\nIn this paper we aim at facilitating generalization for deep networks while support-\ning interpretability of the learned representations. Towards this goal, we propose a\nclustering based regularization that encourages parsimonious representations. Our\nk-means style objective is easy to optimize and \ufb02exible, supporting various forms\nof clustering, such as sample clustering, spatial clustering, as well as co-clustering.\nWe demonstrate the effectiveness of our approach on the tasks of unsupervised\nlearning, classi\ufb01cation, \ufb01ne grained categorization, and zero-shot learning.\n\n1\n\nIntroduction\n\nIn recent years, deep neural networks have been shown to perform extremely well on a variety of\ntasks including classi\ufb01cation [21], semantic segmentation [13], machine translation [27] and speech\nrecognition [16]. This has led to their adoption across many areas such as computer vision, natural\nlanguage processing and robotics [16, 21, 22, 27]. Three major advances are responsible for the\nrecent success of neural networks: the increase in available computational resources, access to large\nscale data sets, and several algorithmic improvements.\nMany of these algorithmic advances are related to regularization, which is key to prevent over\ufb01tting\nand improve generalization of the learned classi\ufb01er, as the current trend is to increase the capacity of\nneural nets. For example, batch normalization [18] is used to normalize intermediate representations\nwhich can be interpreted as imposing constraints. In contrast, dropout [26] removes a fraction of the\nlearned representations at random to prevent co-adaptation. Learning of de-correlated activations [6]\nshares a similar idea since it explicitly discourages correlation between the units.\nIn this paper we propose a new type of regularization that encourages the network representations to\nform clusters. As a consequence, the learned feature space is compactly representable, facilitating\ngeneralization. Furthermore, clustering supports interpretability of the learned representations. We\nformulate our regularization with a k-means style objective which is easy to optimize, and investigate\ndifferent types of clusterings, including sample clustering, spatial clustering, and co-clustering.\nWe demonstrate the generalization performance of our proposed method in several settings: au-\ntoencoders trained on the MNIST dataset [23], classi\ufb01cation on CIFAR10 and CIFAR100 [20], as\nwell as \ufb01ne-grained classi\ufb01cation and zero-shot learning on the CUB-200-2011 dataset [34]. We\nshow that our approach leads to signi\ufb01cant wins in all these scenarios. In addition, we are able to\ndemonstrate on the CUB-200-2011 dataset that the network representation captures meaningful part\nrepresentations even though it is not explicitly trained to do so.\n\n2 Related Work\n\nStandard neural network regularization involves penalties on the weights based on the norm of the\nparameters [29, 30]. Also popular are regularization methods applied to intermediate representations,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsuch as Dropout [26], Drop-Connect [32], Maxout [10] and DeCov [6]. These approaches share the\naim of preventing the activations in the network to be correlated. Our work can be seen as a different\nform of regularization, where we encourage parsimonious representations.\nA variety of approaches have applied clustering to the parameters of the neural network with the\naim of compressing the network. Compression rates of more than an order of magnitude were\ndemonstrated in [11] without sacri\ufb01cing accuracy. In the same spirit hash functions were exploited in\n[5]. Early approaches to compression include biased weight decay [12] and [14, 24], which prunes\nthe network based on the Hessian of the loss function.\nRecently, various combinations of clustering with representation learning have been proposed. We\ncategorize them broadly into two areas: (i) work that applies clustering after having learned a\nrepresentation, and (ii) approaches that jointly optimize the learning and clustering objectives. [4]\ncombines deep belief networks (DBN) with non-parametric maximum-margin clustering in a post-\nhoc manner: A DBN is trained layer-wise to obtain an intermediate representation of the data;\nnon-parametric maximum-margin clustering is then applied to the data representation. Another line of\nwork utilizes an embedding of the deep network, which can be based on annotated data [15], or from\na learned unsupervised method such as a stacked auto-encoder [28]. In these approaches, the network\nis trained to approximate the embedding, and subsequently either k-means or spectral clustering is\nperformed to partition the space. An alternative is to use non-negative matrix factorization, which\nrepresents a given data matrix as the product of components [31]. This deep non-negative matrix\nfactorization is trained using the reconstruction loss rather than a clustering objective. Nonetheless,\nit was shown that factors lower in the hierarchy have superior clustering performance on low-\nlevel concepts while factors later in the hierarchy cluster high-level concepts. The aforementioned\napproaches differ from our proposed technique, since we aim at jointly learning a representation that\nis parsimonious via a clustering regularization.\nAlso related are approaches that utilize sparse coding. Wang et al. [33] unrolls the iterations forming\nthe sparse codes and optimizes end-to-end the involved parameters using a clustering objective as\nloss function [33]. The proposed framework is further augmented by clustering objectives applied to\nintermediate representations, which act as feature regularization within the unrolled optimization.\nThey found that features lower in the unrolled hierarchy cluster low-level concepts, while features\nlater in the hierarchy capture high-level concepts. Our method differs in that we use convolutional\nneural networks rather than unrolling a sparse coding optimization.\nIn the context of unsupervised clustering [35] exploited agglomerative clustering as a regularizer; this\napproach was formulated as a recurrent network. In contrast we employ a k-means like clustering\nobjective which simpli\ufb01es the optimization signi\ufb01cantly and does not require a recurrent procedure.\nFurthermore, we investigate both unsupervised and supervised learning.\n\n3 Learning Deep Parsimonious Representations\n\nIn this section, we introduce our new clustering based regularization which not only encourages the\nneural network to learn more compact representations, but also enables interpretability of the neural\nnetwork. We \ufb01rst show that by exploiting different unfoldings of the representation tensor, we obtain\nmultiple types of clusterings, each possessing different properties. We then devise an ef\ufb01cient online\nupdate to jointly learn the clustering with the parameters of the neural network.\n\n3.1 Clustering of Representations\n\nWe \ufb01rst introduce some notation. We refer to [K] as the set of K positive integers, i.e., [K] =\n{1, 2, ..., K}. We use S\\A to denote the set S with elements from the set A removed. A tensor is\na multilinear map over a set of vector spaces. In tensor terminology, n-mode vectors of a D-order\ntensor Y \u2208 RI1\u00d7I2\u00d7\u00b7\u00b7\u00b7\u00d7ID are In-dimensional vectors obtained from Y by varying the index in In-\ndimension, while keeping all other indices \ufb01xed. An n-mode matrix unfolding of a tensor is a matrix\nwhich has all n-mode vectors as its columns [7]. Formally we use the operator T {In}\u00d7{Ij|j\u2208[D]\\n}\nj\u2208[D]\\n Ij. Similarly,\nwe de\ufb01nee T {Ii,Ij}\u00d7{Ik|k\u2208[D]\\{i,j}} to be an (i, j)-mode matrix unfolding operator. In this case a\ncolumn vector is a concatenation of one i-mode vector and one j-mode vector. We denote the m-th\nrow vector of a matrix X as Xm.\n\nto denote the n-mode matrix unfolding, which returns a matrix of size In \u00d7(cid:81)\n\n2\n\n\fFigure 1:\n(A) Sample clustering and (B) spatial clustering. Samples, pixels, and channels are\nvisualized as multi-channel maps, cubes, and maps in depth respectively. The receptive \ufb01elds in the\ninput image are denoted as red boxes.\n\nIn this paper we assume the representation of one layer within a neural network to be a 4-D tensor\nY \u2208 RN\u00d7C\u00d7H\u00d7W , where N, C, H and W are the number of samples within a mini-batch, the\nnumber of hidden units, the height and width of the representation respectively. Note that C, H and\nW can vary between layers, and in the case of a fully connected layer, the dimensions along height\nand width become a singleton and the tensor degenerates to a matrix.\nLet L be the loss function of a neural network. In addition, we refer to the clustering regularization of\na single layer via R. The \ufb01nal objective is L + \u03bbR, where \u03bb adjusts the importance of the clustering\nregularization. Note that we can add a regularization term for any subset of layers, but we focus on a\nsingle layer for notational simplicity. In what follows, we show three different types of clustering,\neach possessing different properties. In our framework any variant can be applied to any layer.\n\n(A) Sample Clustering: We \ufb01rst investigate clustering along the sample dimension. Since the\ncluster assignments of different layers are not linked, each layer is free to cluster examples in a\ndifferent way. For example, in a ConvNet, bottom layer representations may focus on low-level visual\ncues, such as color and edges, while top layer features may focus on high-level attributes which have\na more semantic meaning. We refer the reader to Fig. 1 (a) for an illustration. In particular, given the\nrepresentation tensor Y, we \ufb01rst unfold it into a matrix T {N}\u00d7{H,W,C}(Y) \u2208 RN\u00d7HW C. We then\nencourage the samples to cluster as follows:\n\nN(cid:88)\n\n(cid:13)(cid:13)(cid:13)T {N}\u00d7{H,W,C}(Y)n \u2212 \u00b5zn\n\n(cid:13)(cid:13)(cid:13)2\n\n,\n\nRsample(Y, \u00b5) =\n\n1\n\nn=1\n\n2N CHW\n\n(1)\nwhere \u00b5 is a matrix of size K \u00d7 HW C encoding all cluster centers, with K the total number of\nclusters. zn \u2208 [K] is a discrete latent variable corresponding to the n-th sample. It indicates which\ncluster this sample belongs to. Note that for a fully connected layer, the formulation is the same\nexcept that T {N}\u00d7{H,W,C}(Y)n and \u00b5zn are C-sized vectors since H = W = 1 in this case.\n(B) Spatial Clustering: The representation of one sample can be regarded as a C-channel \u201cimage.\u201d\nEach spatial location within that \u201cimage\u201d can be thought of as a \u201cpixel,\u201d and is a vector of size C\n(shown as a colored bar in Fig. 1). For a ConvNet, every \u201cpixel\u201d has a corresponding receptive \ufb01eld\ncovering a local region in the input image. Therefore, by clustering \u201cpixels\u201d of all images during\nlearning, we expect to model local parts shared by multiple objects or scenes. To achieve this, we\nadopt the unfolding operator T {N,H,W}\u00d7{C}(Y) and use\n\nRspatial(Y, \u00b5) =\n\n1\n\n(cid:107)T {N,H,W}\u00d7{C}(Y)i \u2212 \u00b5zi(cid:107)2.\n\n(2)\n\nN HW(cid:88)\n\n2N CHW\n\ni=1\n\nNote that although we use the analogy of a \u201cpixel,\u201d when using text data a \u201cpixel\u201d may corresponds\nto words. For spatial clustering the dimension of the matrix \u00b5 is K \u00d7 C.\n\n(C) Channel Co-Clustering: This regularizer groups the channels of different samples directly,\nthus co-clustering samples and \ufb01lters. We expect this type of regularization to model re-occurring\n\n3\n\n(a)RepresentationsCWHN(b)\u2026\fAlgorithm 1 : Learning Parsimonious Representations\n1: Initialization: Maximum training iteration R, batch size B, smooth weight \u03b1, set of clustering\n\nlayers S and set of cluster centers {\u00b50\n\nk|k \u2208 [K]}, update period M\n\n2: For iteration t = 1, 2, ..., R:\nFor layer l = 1, 2, ..., L:\n3:\n4:\n5:\n6:\n\nCompute the output representation of layer l as x.\nIf l \u2208 S:\n(cid:107)Xn \u2212 \u00b5t\u22121\n\nAssigning cluster zn = argmin\nCompute cluster center \u02c6\u00b5k = 1|Nk|\nSmooth cluster center \u00b5t\n\nk\n\nn\u2208Nk\n\nk = \u03b1\u02c6\u00b5k + (1 \u2212 \u03b1)\u00b5t\u22121\n\nk\n\nEnd\n\nEnd\nCompute the gradients with cluster centers \u00b5t\nUpdate weights.\nUpdate drifted cluster centers using Kmeans++ every M iterations.\n\nk \ufb01xed.\n\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: End\n\n(cid:80)\n\nk (cid:107)2,\u2200n \u2208 [B].\n\nXn, where Nk = [B](cid:84){n|zn = k}.\n\npatterns shared not only among different samples but also within each sample. Relying on the\nunfolding operator T {N,C}\u00d7{H,W}(Y), we formulate this type of clustering objective as\n\nRchannel(Y, \u00b5) =\n\n1\n\n2N CHW\n\n(cid:107)T {N,C}\u00d7{H,W}(Y)i \u2212 \u00b5zi(cid:107)2.\n\n(3)\n\nNote that the dimension of the matrix \u00b5 is K \u00d7 HW in this case.\n\nN C(cid:88)\n\ni=1\n\n3.2 Ef\ufb01cient Online Update\n\nWe now derive an ef\ufb01cient online update to jointly learn the weights while clustering the representa-\ntions of the neural network. In particular, we illustrate the sample clustering case while noting that\nthe other types can be derived easily by applying the corresponding unfolding operator. For ease of\nnotation, we denote the unfolded matrix T {N}\u00d7{H,W,C}(Y) as X. The gradient of the clustering\nregularization layer w.r.t. its input representation X can be expressed as,\n\n\u2202R\n\u2202Xn\n\n=\n\n1\n\nN CHW\n\n\uf8ee\uf8f0Xn \u2212 \u00b5zn \u2212 1\n\n(cid:88)\n\n(cid:0)Xn \u2212 \u00b5zp\n\n(cid:1)\uf8f9\uf8fb ,\n\nQzn\n\nzp=zn,\u2200p\u2208[N ]\n\n(4)\n\nwhere Qzn is the number of samples which belong to the zn-th cluster. This gradient is then\nbackpropagated through the network to obtain the gradient w.r.t. the parameters of the network.\nThe time and space complexity of the gradient computation of one regularization layer are\nmax(O(KCHW ),O(N CHW )) and O(N CHW ) respectively. Note that we can cache the cen-\ntered data Xn \u2212 \u00b5zn in the forward pass to speed up the gradient computation.\nThe overall learning algorithm of our framework is summarized in Alg. 1. In the forward pass, we\n\ufb01rst compute the representation of the n-th sample as Xn for each layer. We then infer the latent\ncluster label zn for each sample based on the distance to the cluster centers \u00b5t\u22121\nfrom the last time\nstep t \u2212 1, and assign the sample to the cluster center which has the smallest distance. Once all the\ncluster assignments are computed, we estimate the cluster centers \u02c6\u00b5k based on the new labels of the\ncurrent batch.\nWe then combine the estimate based on the current batch with the former cluster center. This is\ndone via an online update. We found an online update together with the random restart strategy\nto work well in practice, as the learning of the neural network proceeds one mini-batch at a time,\nand as it is too expensive to recompute the cluster assignment for all data samples in every iteration.\nSince we trust our current cluster center estimate more than older ones, we smooth the estimation\nby using an exponential moving average. The cluster center estimate at iteration t is obtained via\nk = \u03b1\u02c6\u00b5k + (1 \u2212 \u03b1)\u00b5t\u22121\n, where \u03b1 is a smoothing weight. However, as the representation learned\n\u00b5t\nby the neural network may go through drastic changes, especially in the beginning of training, some\n\nk\n\nk\n\n4\n\n\fMeasurement\n\nAE\n\nAE + Sample-Clustering\n\nTrain\n\n2.69 \u00b1 0.12\n2.73 \u00b1 0.01\n\nTest\n\n3.61 \u00b1 0.13\n3.50 \u00b1 0.01\n\nTable 1: Autoencoder Experiments on MNIST. We report the average of mean reconstruction error\nover 4 trials and the corresponding standard deviation.\n\nDataset\nCaffe\n\nDeCov\nDropout\n\nWeight Decay\n\nCIFAR10 Train CIFAR10 Test CIFAR100 Train CIFAR100 Test\n94.87 \u00b1 0.14\n46.21 \u00b1 0.34\n95.34 \u00b1 0.27\n46.93 \u00b1 0.42\n88.78 \u00b1 0.23\n48.70 \u00b1 0.38\n99.10 \u00b1 0.17\n50.50 \u00b1 0.38\n89.93 \u00b1 0.19\n90.50 \u00b1 0.05\n50.18 \u00b1 0.49\n89.26 \u00b1 0.25\n49.80 \u00b1 0.25\nTable 2: CIFAR10 and CIFAR 100 results. For DeCov, no standard deviation is provided for the\nCIFAR100 results [6]. All our approaches outperform the baselines.\n\n68.01 \u00b1 0.64\n69.32 \u00b1 0.51\n60.77 \u00b1 0.47\n63.60 \u00b1 0.55\n64.38 \u00b1 0.38\n63.42 \u00b1 1.34\n\n76.32 \u00b1 0.17\n76.79 \u00b1 0.31\n79.72 \u00b1 0.14\n77.45 \u00b1 0.21\n81.05 \u00b1 0.41\n81.02 \u00b1 0.12\n80.65 \u00b1 0.23\n\nSample-Clustering\nSpatial-Clustering\n\nChannel Co-Clustering\n\n40.34\n\n77.92\n\ni d2\n\nn/(cid:80)\n\nof the cluster centers may quickly be less favored and the number of incoming samples assigned to it\nwill be largely reduced. To overcome this issue, we exploit the Kmeans++ [3] procedure to re-sample\nthe cluster center from the current mini-batch. Speci\ufb01cally, denoting the the distance between sample\nXn and its nearest cluster center as dn, the probability of taking Xn as the new cluster center is\ni . After sampling, we replace the old cluster center with the new one and continue the\nd2\nlearning process. In practice, at the end of every epoch, we apply the kmeans++ update to cluster\ncenters for which the number of assigned samples is small. See Alg. 1 for an outline of the steps. The\noverall procedure stabilizes the optimization and also increases the diversity of the cluster centers.\nIn the backward pass, we \ufb01x the latest estimation of the cluster centers \u00b5t\nk and compute the gradient\nof loss function and the gradient of the clustering objective based on Eq. (4). Then we back-propagate\nall the gradients and update the weights.\n\n4 Experiments\n\nIn this section, we conduct experiments on unsupervised, supervised and zero-shot learning on several\ndatasets. Our implementation based on TensorFlow [9] is publicly available.1 For initializing the\ncluster centers before training, we randomly choose them from the representations obtained with the\ninitial network.\n\n4.1 Autoencoder on MNIST\n\nWe \ufb01rst test our method on the unsupervised learning task of training an autoencoder. Our architecture\nis identical to [17]. For ease of training we did not tie the weights between the encoder and the\ndecoder. We use the squared (cid:96)2 reconstruction error as the loss function and SGD with momentum.\nThe standard training-test-split is used. We compute the mean reconstruction error over all test images\nand repeat the experiments 4 times with different random initializations. We compare the baseline\nmodel, i.e., a plain autoencoder, with one that employs our sample-clustering regularization on all\nlayers except the top fully connected layer. Sample clustering was chosen since this autoencoder\nonly contains fully connected layers. The number of clusters and the regularization weight \u03bb of all\nlayers are set to 100 and 1.0e\u22122 respectively. For both models the same learning rate and momentum\nare used. Our exact parameter choices are detailed in the Appendix. As shown in Table 1, our\nregularization facilitates generalization as it suffers less from over\ufb01tting. Speci\ufb01cally, applying our\nregularization results in lower test set error despite slightly higher training error. More importantly,\nthe standard deviation of the error is one order of magnitude smaller for both training and testing\nwhen applying our regularization. This indicates that our sample-clustering regularization stabilizes\nthe model.\n\n1https://github.com/lrjconan/deep_parsimonious\n\n5\n\n\f4\n-\nC\nF\n\n4\n-\nC\nF\n\n2\n-\nv\nn\no\nC\n\n2\n-\nv\nn\no\nC\n\nFigure 2: Visualization of clusterings on CIFAR10 dataset. Rows 1, 2 each show examples belonging\nto a single sample-cluster; rows 3, 4 show regions clustered via spatial clustering.\n\n4.2 CIFAR10 and CIFAR100\n\nIn this section, we explore the CIFAR10 and CIFAR100 datasets [20]. CIFAR10 consists of 60,000\n32 \u00d7 32 images assigned to 10 categories, while CIFAR100 differentiates between 100 classes. We\nuse the standard split on both datasets. The quick CIFAR10 architecture of Caffe [19] is used for\nbenchmarking both datasets. It consists of 3 convolutional layers and 1 fully connected layer followed\nby a softmax layer. The detailed parameters are publicly available on the Caffe [19] website. We\nreport mean accuracy averaged over 4 trials. For fully connected layers we use the sample-clustering\nobjective. For convolutional layers, we provide the results of all three clustering objectives, which we\nrefer to as \u2018sample-clustering,\u2019 \u2018spatial-clustering,\u2019 and \u2018channel-co-clustering\u2019 respectively. We set\nall hyper-parameters based on cross-validation. Speci\ufb01cally, the number of cluster centers are set to\n100 for all layers for both CIFAR10 and CIFAR100. \u03bb is set to 1.0e\u22123 and 1.0e\u22122 for the \ufb01rst two\nconvolutional and the remaining layers respectively in CIFAR10; for CIFAR100, \u03bb is set to 10 and 1\nfor the \ufb01rst convolutional layer and the remaining layers respectively. The smoothness parameter \u03b1 is\nset to 0.9 and 0.95 for CIFAR10 and CIFAR100 respectively.\n\nGeneralization:\nIn Table 2 we compare our framework to some recent regularizers, like DeCov [6],\nDropout [26] and the baseline results obtained using Caffe. We again observe that all of our methods\nachieve better generalization performance.\n\nVisualization:\nTo demonstrate the interpretability of our learned network, we visualize sample-\nclustering and spatial-clustering in Fig. 2, showing the top-10 ranked images and parts per cluster. In\nthe case of sample-clustering, for each cluster we rank all its assigned images based on the distance\nto the cluster center. We chose to show 2 clusters from the 4th fully connected layer. In the case of\nspatial-clustering, we rank all \u201cpixels\u201d belonging to one cluster based on the distance to the cluster\ncenter. Note that we have one part (i.e., one receptive \ufb01eld region in the input image) for each\n\u201cpixel.\u201d We chose to show 2 clusters from the 2nd convolutional layer. The receptive \ufb01eld of the\n2nd convolutional layer is of size 18 \u00d7 18 in the original 32 \u00d7 32 sized image. We observe that\nclusterings of the fully connected layer representations encode high-level semantic meaning. In\ncontrast, clusterings of the convolutional layer representations encode attributes like shape. Note that\nsome parts are uninformative which may be due to the fact that images in CIFAR10 are very small.\nAdditional clusters and visualizations on CIFAR100 are shown in the Appendix.\n\nQuantitative Evaluation of Parsimonious Representation: We quantitatively evaluate our\nlearned parsimonious representation on CIFAR100. Since only the image category is provided\nas ground truth, we investigate sample clustering using the 4th fully connected layer where repre-\nsentations capture semantic meaning. In particular, we apply K-means clustering to the learned\nrepresentation extracted from the model with and without sample clustering respectively. For both\ncases, we set the number of clusters to be 100 and control the random seed to be the same. The\nmost frequent class label within one cluster is assigned to all of its members. Then we compute the\nnormalized mutual information (NMI) [25] to measure the clustering accuracy. The average results\nover 10 runs are shown in Table 3. Our representations achieve signi\ufb01cantly better clustering quality\n\n6\n\n\fMethod\nNMI\n\nBaseline\n\n0.4122 \u00b1 0.0012\n\nSample-Clustering\n0.4914 \u00b1 0.0011\n\nTable 3: Normalized mutual information of sample clustering on CIFAR100.\n\nMethod\n\nTrain\n\nDeCAF [8]\n\nSample-Clustering\nSpatial-Clustering\n\nChannel Co-Clustering\n\n-\n\n100.0\n100.0\n100.0\n\nTest\n58.75\n61.77\n61.67\n61.49\n\nTable 4: Classi\ufb01cation accuracy on CUB-200-2011.\n\ncompared to the baseline which suggests that they are distributed in a more compact way in the\nfeature space.\n\n4.3 CUB-200-2011\n\nNext we test our framework on the Caltech-UCSD birds dataset [34] which contains 11,788 images\nof 200 different categories. We follow the dataset split provided by [34] and the common practice\nof cropping the image using the ground-truth bounding box annotation of the birds [8, 36]. We use\nAlex-Net [21] pretrained on ImageNet as the base model and adapt the last layer to \ufb01t classi\ufb01cation\nof 200 categories. We resize the image to 227 \u00d7 227 to \ufb01t the input size. We add clusterings to all\nlayers except the softmax-layer. Based on cross-validation, the number of clusters are set to 200 for\nall layers. For convolutional layers, we set \u03bb to 1.0e\u22125 for the \ufb01rst (bottom) 2 and use 1.0e\u22124 for\nthe remaining ones. For fully connected layers, we set \u03bb to 1.0e\u22123 and \u03b1 is equal to 0.5. We apply\nKmeans++ to replace cluster centers with less than 10 assigned samples at the end of every epoch.\n\nGeneralization: We investigate the impact of our parsimonious representation on generalization\nperformance. We compare with the DeCAF result reported in [8], which used the same network\nto extract a representation and applied logistic regression on top for \ufb01ne-tuning. We also \ufb01ne-tune\nAlex-Net which uses weight-decay and Dropout, and report the best result we achieved in Table 4.\nWe observe that for the Alex-Net architecture our clustering improves the generalization compare to\ndirect \ufb01ne-tuning and the DeCAF result. Note that Alex-Net pretrained on ImageNet easily over\ufb01ts\non this dataset as all training accuracies reach 100 percent.\n\nVisualization: To visualize the sample-clustering and spatial-clustering we follow the setting\nemployed when evaluating on the CIFAR dataset. For the selected cluster center we show the 10\nclosest images in Fig. 3. For sample clustering, 2 clusters from the 3rd convolutional layer and the\n7th fully connected layer are chosen for visualization. For spatial clustering, 2 clusters from the 2nd\nand 3rd convolutional layers are chosen for visualization. More clusters are shown in the Appendix.\nThe receptive \ufb01elds of pixels from the 2nd and 3rd convolutional layers are of sizes 59 \u00d7 59 and\n123 \u00d7 123 in the resized 227 \u00d7 227 image. We observe that cluster centers of sample clustering\napplied to layers lower in the network capture pose and shape information, while cluster centers from\ntop layers model the \ufb01ne-grained categories of birds. For spatial clustering, cluster centers from\ndifferent layers capture parts of birds in different scales, like the beak, chest, etc.\n\n4.4 Zero-Shot Learning\n\nWe also investigate a zero-shot setting on the CUB dataset to see whether our parsimonious repre-\nsentation is applicable to unseen categories. We follow the setting in [1, 2] and use the same split\nwhere 100, 50 and 50 classes are used as training, validation and testing (unseen classes). We use\na pre-trained Alex-Net as the baseline model and extract 4096-dimension representations from the\n7th fully connected (fc) layer. We compare sample-clustering against other recent methods which\nalso report results of using 7th fc feature of Alex-Net. Given these features, we learn the output\nembedding W via the same unregularized structured SVM as in [1, 2]:\n\nN(cid:88)\n\nn=1\n\nmin\nW\n\n1\nN\n\nmax\ny\u2208Y\n\n(cid:8)0, \u2206(yn, y) + x(cid:62)\n\nn W [\u03c6(y) \u2212 \u03c6(yn)])(cid:9) ,\n\n(5)\n\nwhere xn and yn are the feature and class label of the n-th sample and \u2206 is the 0-1 loss function.. \u03c6\nis the class-attribute matrix provided by the CUB dataset, where each entry is a real-valued score\n\n7\n\n\f3\n-\nv\nn\no\nC\n\n3\n-\nv\nn\no\nC\n\n7\n-\nC\nF\n\n7\n-\nC\nF\n\n2\n-\nv\nn\no\nC\n\n2\n-\nv\nn\no\nC\n\n3\n-\nv\nn\no\nC\n\n3\n-\nv\nn\no\nC\n\nFigure 3: Visualization of sample and pixel clustering on CUB-200-2011 dataset. Row 1-4 and 5-8\nshow sample and spatial clusters respectively. Receptive \ufb01elds are truncated to \ufb01t images.\n\nMethod\nALE [1]\nSJE [2]\n\nSample-Clustering\n\nTable 5: Zero-shot learning on CUB-200-2011.\n\nTop1 Accuracy\n\n26.9\n40.3\n46.1\n\nindicating how likely a human thinks one attribute is present in a given class. We tune the hyper-\nparameters on the validation set and report results in terms of top-1 accuracy averaged over the unseen\nclasses. As shown in Table 5 our approach signi\ufb01cantly outperforms other approaches.\n\n5 Conclusions\n\nWe have proposed a novel clustering based regularization which encourages parsimonious represen-\ntations, while being easy to optimize. We have demonstrated the effectiveness of our approach on\na variety of tasks including unsupervised learning, classi\ufb01cation, \ufb01ne grained categorization, and\nzero-shot learning. In the future we plan to apply our approach to even larger networks, e.g., residual\nnets, and develop a probabilistic formulation which provides a soft clustering.\n\nAcknowledgments\n\nThis work was partially supported by ONR-N00014-14-1-0232, NVIDIA and the Intelligence Ad-\nvanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center\n(DoI/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and\ndistribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.\nDisclaimer: The views and conclusions contained herein are those of the authors and should not\nbe interpreted as necessarily representing the of\ufb01cial policies or endorsements, either expressed or\nimplied, of IARPA, DoI/IBC, or the U.S. Government.\n\n8\n\n\fReferences\n[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classi\ufb01cation.\n\nIn Proc. CVPR, 2013.\n\n[2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for \ufb01ne-grained\n\nimage classi\ufb01cation. In Proc. CVPR, 2015.\n\n[3] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proc. SODA, 2007.\n[4] G. Chen. Deep learning with nonparametric clustering. arXiv preprint arXiv:1501.03084, 2015.\n[5] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the\n\nhashing trick. In arXiv preprint arXiv:1504.04788, 2015.\n\n[6] M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra. Reducing Over\ufb01tting in Deep Networks by\n\nDecorrelating Representations. Proc. ICLR, 2016.\n\n[7] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM\n\nJournal on Matrix Analysis and Applications, 2000.\n\n[8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional\n\nactivation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013.\n\n[9] M. A. et. al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software\n\n[10] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout Networks. Proc. ICLR,\n\navailable from tensor\ufb02ow.org.\n\n2013.\n\n[11] S. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning,\n\nTrained Quantization and Huffman Coding. In Proc. ICLR, 2016.\n\n[12] S. J. Hanson and L. Y. Pratt. Comparing biases for minimal network construction with back-propagation.\n\n[13] B. Hariharan, P. Arbel\u00e1ez, R. Girshick, and J. Malik. Simultaneous Detection and Segmentation. In Proc.\n\n[14] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Proc.\n\nIn Proc. NIPS, 1989.\n\nECCV, 2014.\n\nNIPS, 1993.\n\n[15] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe. Deep clustering: Discriminative embeddings for\n\nsegmentation and separation. arXiv preprint arXiv:1508.04306, 2015.\n\n[16] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,\nT. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views\nof four research groups. IEEE Signal Processing Magazine, 2012.\n\n[17] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n[18] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal\n\nCovariate Shift. In Proc. ICML, 2015.\n\n[19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[20] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009.\n[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Proc. NIPS, 2012.\n\n[22] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.\n[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProc. of IEEE, 1998.\n\n[24] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal Brain Damage. In Proc. NIPS, 1989.\n[25] C. D. Manning, P. Raghavan, and H. Sch\u00fctze. Introduction to information retrieval, 2008.\n[26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\n\nprevent neural networks from over\ufb01tting. JMLR, 2014.\n\n[27] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proc. NIPS,\n\n2014.\n\nProc. AAAI, 2014.\n\n[28] F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu. Learning deep representations for graph clustering. In\n\n[29] R. Tibshirani. Regression shrinkage and selection via the lasso. J. of the Royal Statistical Society, 1996.\n[30] A. N. Tikhonov. On the stability of inverse problems. USSR Academy of Sciences, 1943.\n[31] G. Trigeorgis, K. Bousmalis, S. Zafeiriou, and B. Schuller. A deep semi-nmf model for learning hidden\n\n[32] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using dropconnect.\n\n2006.\n\nrepresentations. In Proc. ICML, 2014.\n\nProc. ICML, 2013.\n\narXiv preprint arXiv:1509.00151, 2015.\n\n[33] Z. Wang, S. Chang, J. Zhou, and T. S. Huang. Learning a task-speci\ufb01c deep architecture for clustering.\n\n[34] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200.\n\nTechnical Report CNS-TR-2010-001, California Institute of Technology, 2010.\n\n[35] J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters.\n\n[36] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based r-cnns for \ufb01ne-grained category detection.\n\narXiv preprint arXiv:1604.03628, 2016.\n\nIn Proc. ECCV, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2271, "authors": [{"given_name": "Renjie", "family_name": "Liao", "institution": "UofT"}, {"given_name": "Alex", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "University of Toronto"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "University of Toronto"}]}