{"title": "Selecting Receptive Fields in Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2528, "page_last": 2536, "abstract": "Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer.  Unfortunately, for such large architectures the number of parameters usually grows quadratically in the width of the network, thus necessitating hand-coded \"local receptive fields\" that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality).  In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods.  Specifically, we choose local receptive fields that group together those low-level features that are most similar to each other according to a pairwise similarity metric.  This approach  allows us to harness the advantages of local receptive fields (such  as improved scalability, and reduced data requirements) when we do  not know how to specify such receptive fields by hand or where our  unsupervised training algorithm has no obvious generalization to a  topographic setting.  We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered etworks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively.", "full_text": "Selecting Receptive Fields in Deep Networks\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nAndrew Y. Ng\n\nStanford University\nStanford, CA 94305\n\nAdam Coates\n\nStanford University\nStanford, CA 94305\n\nacoates@cs.stanford.edu\n\nang@cs.stanford.edu\n\nAbstract\n\nRecent deep learning and unsupervised feature learning systems that learn from\nunlabeled data have achieved high performance in benchmarks by using extremely\nlarge architectures with many features (hidden units) at each layer. Unfortunately,\nfor such large architectures the number of parameters can grow quadratically in the\nwidth of the network, thus necessitating hand-coded \u201clocal receptive \ufb01elds\u201d that\nlimit the number of connections from lower level features to higher ones (e.g.,\nbased on spatial locality). In this paper we propose a fast method to choose these\nconnections that may be incorporated into a wide variety of unsupervised training\nmethods. Speci\ufb01cally, we choose local receptive \ufb01elds that group together those\nlow-level features that are most similar to each other according to a pairwise simi-\nlarity metric. This approach allows us to harness the advantages of local receptive\n\ufb01elds (such as improved scalability, and reduced data requirements) when we do\nnot know how to specify such receptive \ufb01elds by hand or where our unsupervised\ntraining algorithm has no obvious generalization to a topographic setting. We\nproduce results showing how this method allows us to use even simple unsuper-\nvised training algorithms to train successful multi-layered networks that achieve\nstate-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy,\nrespectively.\n\n1\n\nIntroduction\n\nMuch recent research has focused on training deep, multi-layered networks of feature extractors ap-\nplied to challenging visual tasks like object recognition. An important practical concern in building\nsuch networks is to specify how the features in each layer connect to the features in the layers be-\nneath. Traditionally, the number of parameters in networks for visual tasks is reduced by restricting\nhigher level units to receive inputs only from a \u201creceptive \ufb01eld\u201d of lower-level inputs. For instance,\nin the \ufb01rst layer of a network used for object recognition it is common to connect each feature extrac-\ntor to a small rectangular area within a larger image instead of connecting every feature to the entire\nimage [14, 15]. This trick dramatically reduces the number of parameters that must be trained and\nis a key element of several state-of-the-art systems [4, 19, 6]. In this paper, we propose a method to\nautomatically choose such receptive \ufb01elds in situations where we do not know how to specify them\nby hand\u2014a situation that, as we will explain, is commonly encountered in deep networks.\nThere are now many results in the literature indicating that large networks with thousands of unique\nfeature extractors are top competitors in applications and benchmarks (e.g., [4, 6, 9, 19]). A ma-\njor obstacle to scaling up these representations further is the blowup in the number of network\nparameters: for n input features, a complete representation with n features requires a matrix of\nn2 weights\u2014one weight for every feature and input. This blowup leads to a number of practical\nproblems: (i) it becomes dif\ufb01cult to represent, and even more dif\ufb01cult to update, the entire weight\nmatrix during learning, (ii) feature extraction becomes extremely slow, and (iii) many algorithms\nand techniques (like whitening and local contrast normalization) are dif\ufb01cult to generalize to large,\n\n1\n\n\funstructured input domains. As mentioned above, we can solve this problem by limiting the \u201cfan\nin\u201d to each feature by connecting each feature extractor to a small receptive \ufb01eld of inputs. In this\nwork, we will propose a method that chooses these receptive \ufb01elds automatically during unsuper-\nvised training of deep networks. The scheme can operate without prior knowledge of the underlying\ndata and is applicable to virtually any unsupervised feature learning or pre-training pipeline. In our\nexperiments, we will show that when this method is combined with a recently proposed learning\nsystem, we can construct highly scalable architectures that achieve accuracy on CIFAR-10 and STL\ndatasets beyond the best previously published.\nIt may not be clear yet why it is necessary to have an automated way to choose receptive \ufb01elds since,\nafter all, it is already common practice to pick receptive \ufb01elds simply based on prior knowledge.\nHowever, this type of solution is insuf\ufb01cient for large, deep representations. For instance, in local\nreceptive \ufb01eld architectures for image data, we typically train a bank of linear \ufb01lters that apply only\nto a small image patch. These \ufb01lters are then convolved with the input image to yield the \ufb01rst\nlayer of features. As an example, if we train 100 5-by-5 pixel \ufb01lters and convolve them with a\n32-by-32 pixel input, then we will get a 28-by-28-by-100 array of features. Each 2D grid of 28-by-\n28 feature responses for a single \ufb01lter is frequently called a \u201cmap\u201d [14, 4]. Though there are still\nspatial relationships amongst the feature values within each map, it is not clear how two features in\ndifferent maps are related. Thus when we train a second layer of features we must typically resort to\nconnecting each feature to every input map or to a random subset of maps [12, 4] (though we may\nstill take advantage of the remaining spatial organization within each map). At even higher layers of\ndeep networks, this problem becomes extreme: our array of responses will have very small spatial\nresolution (e.g., 1-by-1) yet will have a large number of maps and thus we can no longer make use of\nspatial receptive \ufb01elds. This problem is exacerbated further when we try to use very large numbers\nof maps which are often necessary to achieve top performance [4, 5].\nIn this work we propose a way to address the problem of choosing receptive \ufb01elds that is not only\na \ufb02exible addition to unsupervised learning and pre-training pipelines, but that can scale up to the\nextremely large networks used in state-of-the-art systems. In our method we select local receptive\n\ufb01elds that group together (pre-trained) lower-level features according to a pairwise similarity metric\nbetween features. Each receptive \ufb01eld is constructed using a greedy selection scheme so that it\ncontains features that are similar according to the similarity metric. Depending on the choice of\nmetric, we can cause our system to choose receptive \ufb01elds that are similar to those that might be\nlearned implicitly by popular learning algorithms like ICA [11]. Given the learned receptive \ufb01elds\n(groups of features) we can subsequently apply an unsupervised learning method independently\nover each receptive \ufb01eld. Thus, this method frees us to use any unsupervised learning algorithm\nto train the weights of the next layer. Using our method in conjunction with the pipeline proposed\nby [6], we demonstrate the ability to train multi-layered networks using only vector quantization as\nour unsupervised learning module. All of our results are achieved without supervised \ufb01ne-tuning\n(i.e., backpropagation), and thus rely heavily on the success of the unsupervised learning procedure.\nNevertheless, we attain the best known performances on the CIFAR-10 and STL datasets.\nWe will now discuss some additional work related to our approach in Section 2. Details of our\nmethod are given in Section 3 followed by our experimental results in Section 4.\n\n2 Related Work\n\nWhile much work has focused on different representations for deep networks, an orthogonal line of\nwork has investigated the effect of network structure on performance of these systems. Much of this\nline of inquiry has sought to identify the best choices of network parameters such as size, activation\nfunction, pooling method and so on [12, 5, 3, 16, 19]. Through these investigations a handful of\nkey factors have been identi\ufb01ed that strongly in\ufb02uence performance (such as the type of pooling,\nactivation function, and number of features). These works, however, do not address the \ufb01ner-grained\nquestions of how to choose the internal structure of deep networks directly.\nOther authors have tackled the problem of architecture selection more generally. One approach is\nto search for the best architecture. For instance, Saxe et al. [18] propose using randomly initialized\nnetworks (forgoing the expense of training) to search for a high-performing structure. Pinto et\nal. [17], on the other hand, use a screening procedure to choose from amongst large numbers of\nrandomly composed networks, collecting the best performing networks.\n\n2\n\n\fMore powerful modeling and optimization techniques have also been used for learning the struc-\nture of deep networks in-situ. For instance, Adams et al. [1] use a non-parametric Bayesian prior\nto jointly infer the depth and number of hidden units at each layer of a deep belief network during\ntraining. Zhang and Chan [21] use an L1 penalization scheme to zero out many of the connections in\nan otherwise bipartite structure. Unfortunately, these methods require optimizations that are as com-\nplex or expensive as the algorithms they augment, thus making it dif\ufb01cult to achieve computational\ngains from any architectural knowledge discovered by these systems.\nIn this work, the receptive \ufb01elds will be built by analyzing the relationships between feature re-\nsponses rather than relying on prior knowledge of their organization. A popular alternative solution\nis to impose topographic organization on the feature outputs during training. In general, these learn-\ning algorithms train a set of features (usually linear \ufb01lters) such that features nearby in a pre-speci\ufb01ed\ntopography share certain characteristics. The Topographic ICA algorithm [10], for instance, uses a\nprobabilistic model that implies that nearby features in the topography have correlated variances\n(i.e., energies). This statistical measure of similarity is motivated by empirical observations of\nneurons and has been used in other analytical models [20]. Similar methods can be obtained by\nimposing group sparsity constraints so that features within a group tend to be on or off at the same\ntime [7, 8]. These methods have many advantages but require us to specify a topography \ufb01rst, then\nsolve a large-scale optimization problem in order to organize our features according to the given\ntopographic layout. This will typically involve many epochs of training and repeated feature eval-\nuations in order to succeed. In this work, we perform this procedure in reverse: our features are\npre-trained using whatever method we like, then we will extract a useful grouping of the features\npost-hoc. This approach has the advantage that it can be scaled to large distributed clusters and is\nvery generic, allowing us to potentially use different types of grouping criteria and learning strate-\ngies in the future with few changes. In that respect, part of the novelty in our approach is to convert\nexisting notions of topography and statistical dependence in deep networks into a highly scalable\n\u201cwrapper method\u201d that can be re-used with other algorithms.\n\n3 Algorithm Details\n\nIn this section we will describe our approach to selecting the connections between high-level fea-\ntures and their lower-level inputs (i.e., how to \u201clearn\u201d the receptive \ufb01eld structure of the high-level\nfeatures) from an arbitrary set of data based on a particular pairwise similarity metric: square-\ncorrelation of feature responses.1 We will then explain how our method integrates with a typical\nlearning pipeline and, in particular, how to couple our algorithm with the feature learning system\nproposed in [6], which we adopt since it has been shown previously to perform well on image recog-\nnition tasks.\nIn what follows, we assume that we are given a dataset X of feature vectors x(i), i \u2208 {1, . . . , m},\nwith elements x(i)\nj . These vectors may be raw features (e.g., pixel values) but will usually be features\ngenerated by lower layers of a deep network.\n\n3.1 Similarity of Features\n\nIn order to group features together, we must \ufb01rst de\ufb01ne a similarity metric between features. Ideally,\nwe should group together features that are closely related (e.g., because they respond to similar\npatterns or tend to appear together). By putting such features in the same receptive \ufb01eld, we allow\ntheir relationship to be modeled more \ufb01nely by higher level learning algorithms. Meanwhile, it also\nmakes sense to model seemingly independent subsets of features separately, and thus we would like\nsuch features to end up in different receptive \ufb01elds.\nA number of criteria might be used to quantify this type of relationship between features. One\npopular choice is \u201csquare correlation\u201d of feature responses, which partly underpins the Topographic\nICA [10] algorithm. The idea is that if our dataset X consists of linearly uncorrelated features (as\ncan be obtained by applying a whitening procedure), then a measure of the higher-order dependence\nbetween two features can be obtained by looking at the correlation of their energies (squared re-\n\nsponses). In particular, if we have E [x] = 0 and E(cid:2)xx>(cid:3) = I, then we will de\ufb01ne the similarity\n\n1Though we use this metric throughout, and propose some extensions, it can be replaced by many other\n\nchoices such as the mutual information between two features.\n\n3\n\n\fbetween features xj and xk as the correlation between the squared responses:\n\nk) = E(cid:2)x2\n\nj x2\n\nqE(cid:2)x4\n\nj \u2212 1(cid:3) E [x4\n\nk \u2212 1].\n\nS[xj, xk] = corr(x2\n\nj , x2\n\nThis metric is easy to compute by \ufb01rst whitening our input dataset with ZCA2 whitening [2], then\ncomputing the pairwise similarities between all of the features:\n\nk \u2212 1(cid:3) /\nP\n\ni x(i)\n\nSj,k \u2261 SX[xj, xk] \u2261\n\nqP\n\nj\n\n2\n\nx(i)\nk\n\n4 \u2212 1)P\n\n2 \u2212 1\ni(x(i)\n\nk\n\ni(x(i)\n\nj\n\n.\n\n4 \u2212 1)\n\n(1)\n\nThis computation is completely practical for fewer than 5000 input features. For fewer than 10000\nfeatures it is feasible but somewhat arduous: we must not only hold a 10000x10000 matrix in mem-\nory but we must also whiten our 10000-feature dataset\u2014requiring a singular value or eigenvalue\ndecomposition. We will explain how this expense can be avoided in Section 3.3, after we describe\nour receptive \ufb01eld learning procedure.\n\n3.2 Selecting Local Receptive Fields\n\nWe now assume that we have available to us the matrix of pairwise similarities between features Sj,k\ncomputed as above. Our goal is to construct \u201creceptive \ufb01elds\u201d: sets of features Rn, n = 1, . . . , N\nwhose responses will become the inputs to one or more higher-level features. We would like for\neach Rn to contain pairs of features with large values of Sj,k. We might achieve this using various\nagglomerative or spectral clustering methods, but we have found that a simple greedy procedure\nworks well: we choose one feature as a seed, and then group it with its nearest neighbors according\nto the similarities Sj,k. In detail, we \ufb01rst select N rows, j1, . . . , jN , of the matrix S at random\n(corresponding to a random choice of features xjn to be the seed of each group). We then construct\na receptive \ufb01eld Rn that contains the features xk corresponding to the top T values of Sjn,k. We\ntypically use T = 200, though our results are not too sensitive to this parameter. Upon completion,\nwe have N (possibly overlapping) receptive \ufb01elds Rn that can be used during training of the next\nlayer of features.\n\n3.3 Approximate Similarity\n\nComputing the similarity matrix Sj,k using square correlation is practical for fairly large numbers\nof features using the obvious procedure given above. However, if we want to learn receptive \ufb01elds\nover huge numbers of features (as arise, for instance, when we use hundreds or thousands of maps),\nwe may often be unable to compute S directly. For instance, as explained above, if we use square\ncorrelation as our similarity criterion then we must perform whitening over a large number of fea-\ntures.\nNote, however, that the greedy grouping scheme we use requires only N rows of the matrix. Thus,\nprovided we can compute Sj,k for a single pair of features, we can avoid storing the entire matrix\nS. To avoid performing the whitening step for all of the input features, we can instead perform\npair-wise whitening between features. Speci\ufb01cally, to compute the squared correlation of xj and\nxk, we whiten the jth and kth columns of X together (independently of all other columns), then\ncompute the square correlation between the whitened values \u02c6xj and \u02c6xk. Though this procedure is\nnot equivalent to performing full whitening, it appears to yield effective estimates for the squared\ncorrelation between two features in practice. For instance, for a given \u201cseed\u201d, the receptive \ufb01eld\nchosen using this approximation typically overlaps with the \u201ctrue\u201d receptive \ufb01eld (computed with\nfull whitening) by 70% or more. More importantly, our \ufb01nal results (Section 4) are unchanged\ncompared to the exact procedure.\nCompared to the \u201cbrute force\u201d computation of the similarity matrix, the approximation described\nabove is very fast and easy to distribute across a cluster of machines. Speci\ufb01cally, the 2x2 ZCA\nwhitening transform for a pair of features can be computed analytically, and thus we can express\nthe pair-wise square correlations analytically as a function of the original inputs without having to\n\n2If E\u02c6xx>\u02dc = \u03a3 = V DV >, ZCA whitening uses the transform P = V D\u22121/2V > to compute the\n\nwhitened vector \u02c6x as \u02c6x = P x.\n\n4\n\n\fnumerically perform the whitening on all pairs of features. If we assume that all of the input features\nof x(i) are zero-mean and unit variance, then we have:\n\n\u02c6x(i)\nj =\n\n\u02c6x(i)\nk =\n\n1\n2\n1\n2\n\n((\u03b3jk + \u03b2jk)x(i)\n((\u03b3jk \u2212 \u03b2jk)x(i)\n\nj + (\u03b3jk \u2212 \u03b2jk)x(i)\nk )\nj + (\u03b3jk + \u03b2jk)x(i)\nk )\n\nwhere \u03b2jk = (1 \u2212 \u03b1jk)\u22121/2, \u03b3jk = (1 + \u03b1jk)\u22121/2 and \u03b1jk is the correlation between xj and xk.\nSubstituting \u02c6x(i) for x(i) in Equation 1 and expanding yields an expression for the similarity Sj,k\nin terms of the pair-wise moments of each feature (up to fourth order). We can typically implement\nthese computations in a single pass over the dataset that accumulates the needed statistics and then\nselects the receptive \ufb01elds based on the results. Many alternative methods (e.g., Topographic ICA)\nwould require some form of distributed optimization algorithm to achieve a similar result, which\nrequires many feed-forward and feed-back passes over the dataset. In contrast, the above method is\ntypically less expensive than a single feed-forward pass (to compute the feature values x(i)) and is\nthus very fast compared to other conceivable solutions.\n\n3.4 Learning Architecture\n\nWe have adopted the architecture of [6], which has previously been applied with success to image\nrecognition problems. In this section we will brie\ufb02y review this system as it is used in conjunction\nwith our receptive \ufb01eld learning approach, but it should be noted that our basic method is equally\napplicable to many other choices of processing pipeline and unsupervised learning method.\nThe architecture proposed by [6] works by constructing a feature representation of a small image\npatch (say, a 6-by-6 pixel region) and then extracting these features from many overlapping patches\nwithin a larger image (much like a convolutional neural net).\nLet X \u2208 Rm\u00d7108 be a dataset composed of a large number of 3-channel (RGB), 6-by-6 pixel image\npatches extracted from random locations in unlabeled training images and let x(i) \u2208 R108 be the\nvector of RGB pixel values representing the ith patch. Then the system in [6] applies the following\nprocedure to learn a new representation of an image patch:\n\n1. Normalize each example x(i) by subtracting out the mean and dividing by the norm. Apply\n\na ZCA whitening transform to x(i) to yield \u02c6x(i).\n\n2. Apply an unsupervised learning algorithm (e.g., K-means or sparse coding) to obtain a\n\n(normalized) set of linear \ufb01lters (a \u201cdictionary\u201d), D.\n\n3. De\ufb01ne a mapping from the whitened input vectors \u02c6x(i) to output features given the dic-\nj =\n\ntionary D. We use a soft threshold function that computes each feature f (i)\nmax{0,D(j)>\u02c6x(i) \u2212 t} for a \ufb01xed threshold t.\n\nas f (i)\n\nj\n\nThe computed feature values for each example, f (i), become the new representation for the patch\nx(i). We can now apply the learned feature extractor produced by this method to a larger image, say,\na 32-by-32 pixel RGB color image. This large image can be represented generally as a long vector\nwith 32\u00d7 32\u00d7 3 = 3072 elements. To compute its feature representation we simply extract features\nfrom every overlapping patch within the image (using a stride of 1 pixel between patches) and then\nconcatenate all of the features into a single vector, yielding a (usually large) new representation of\nthe entire image.\nClearly we can modify this procedure to use choices of receptive \ufb01elds other than 6-by-6 patches\nof images. Concretely, given the 32-by-32 pixel image, we could break it up into arbitrary choices\nof overlapping sets Rn where each Rn includes a subset of the RGB values of the whole image.\nThen we apply the procedure outlined above to each set of features Rn independently, followed by\nconcatenating all of the extracted features. In general, if X is now any training set (not necessarily\nimage patches), we can de\ufb01ne XRn as the training set X reduced to include only the features in one\nreceptive \ufb01eld, Rn (that is, we simply discard all of the columns of X that do not correspond to\nfeatures in Rn). We may then apply the feature learning and extraction methods above to each XRn\nseparately, just as we would for the hand-chosen patch receptive \ufb01elds used in previous work.\n\n5\n\n\f3.5 Network Details\n\nThe above components, conceptually, allow us to lump together arbitrary types and quantities of data\ninto our unlabeled training set and then automatically partition them into receptive \ufb01elds in order to\nlearn higher-level features. The automated receptive \ufb01eld selection can choose receptive \ufb01elds that\nspan multiple feature maps, but the receptive \ufb01elds will often span only small spatial areas (since\nfeatures extracted from locations far apart tend to appear nearly independent). Thus, we will also\nexploit spatial knowledge to enable us to use large numbers of maps rather than trying to treat the\nentire input as unstructured data. Note that this is mainly to reduce the expense of feature extraction\nand to allow us to use spatial pooling (which introduces some invariance between layers of features);\nthe receptive \ufb01eld selection method itself can be applied to hundreds of thousands of inputs. We now\ndetail the network structure used for our experiments that incorporates this structure.\nFirst, there is little point in applying the receptive \ufb01eld learning method to the raw pixel layer. Thus,\nwe use 6-by-6 pixel receptive \ufb01elds with a stride (step) of 1 pixel between them for the \ufb01rst layer\nof features. If the \ufb01rst layer contains K1 maps (i.e., K1 \ufb01lters), then a 32-by-32 pixel color image\ntakes on a 27-by-27-by-K1 representation after the \ufb01rst layer of (convolutional) feature extraction.\nSecond, depending on the unsupervised learning module, it can be dif\ufb01cult to learn features that are\ninvariant to image transformations like translation. This is handled traditionally by incorporating\n\u201cpooling\u201d layers [3, 14]. Here we use average pooling over adjacent, disjoint 3-by-3 spatial blocks.\nThus, applied to the 27-by-27-by-K1 representation from layer 1, this yields a 9-by-9-by-K1 pooled\nrepresentation.\nAfter extracting the 9-by-9-by-K1 pooled representation from the \ufb01rst two layers, we apply our re-\nceptive \ufb01eld selection method. We could certainly apply the algorithm to the entire high-dimensional\nrepresentation. As explained above, it is useful to retain spatial structure so that we can perform\nspatial pooling and convolutional feature extraction. Thus, rather than applying our algorithm to the\nentire input, we apply the receptive \ufb01eld learning to 2-by-2 spatial regions within the 9-by-9-by-K1\npooled representation. Thus the receptive \ufb01eld learning algorithm must \ufb01nd receptive \ufb01elds to cover\n2 \u00d7 2 \u00d7 K1 inputs. The next layer of feature learning then operates on each receptive \ufb01eld within\nthe 2-by-2 spatial regions separately. This is similar to the structure commonly employed by prior\nwork [4, 12], but here we are able to choose receptive \ufb01elds that span several feature maps in a\ndeliberate way while also exploiting knowledge of the spatial structure.\nIn our experiments we will benchmark our system on image recognition datasets using K1 = 1600\n\ufb01rst layer maps and K2 = 3200 second layer maps learned from N = 32 receptive \ufb01elds. When we\nuse three layers, we apply an additional 2-by-2 average pooling stage to the layer 2 outputs (with\nstride of 1) and then train K3 = 3200 third layer maps (again with N = 32 receptive \ufb01elds). To\nconstruct a \ufb01nal feature representation for classi\ufb01cation, the outputs of the \ufb01rst and second layers of\ntrained features are average-pooled over quadrants as is done by [6]. Thus, our \ufb01rst layer of features\nresult in 1600 \u00d7 4 = 6400 values in the \ufb01nal feature vector, and our second layer of features results\nin 3200\u00d74 = 12800 values. When using a third layer, we use average pooling over the entire image\nto yield 3200 additional feature values. The features for all layers are then concatenated into a single\nlong vector and used to train a linear classi\ufb01er (L2-SVM).\n\n4 Experimental Results\n\nWe have applied our method to several benchmark visual recognition problems: the CIFAR-10 and\nSTL datasets. In addition to training on the full CIFAR training set, we also provide results of our\nmethod when we use only 400 training examples per class to compare with other single-layer results\nin [6].\nThe CIFAR-10 examples are all 32-by-32 pixel color images. For the STL dataset, we downsample\nthe (96 pixel) images to 32 pixels. We use the pipeline detailed in Section 3.4, with vector quantiza-\ntion (VQ) as the unsupervised learning module to train up to 3 layers. For each set of experiments\nwe provide test results for 1 to 3 layers of features, where the receptive \ufb01elds for the 2nd and 3rd lay-\ners of features are learned using the method of Section 3.2 and square-correlation for the similarity\nmetric.\nFor comparison, we also provide test results in each case using several alternative receptive \ufb01eld\nchoices. In particular, we have also tested architectures where we use a single receptive \ufb01eld (N = 1)\n\n6\n\n\fwhere R1 contains all of the inputs, and random receptive \ufb01elds (N = 32) where Rn is \ufb01lled accord-\ning to the same algorithm as in Section 3.2, but where the matrix S is set to random values. The \ufb01rst\nmethod corresponds to the \u201ccompletely connected\u201d, brute-force case described in the introduction,\nwhile the second is the \u201crandomly connected\u201d case. Note that in these cases we use the same spatial\norganization outlined in Section 3.5. For instance, the completely-connected layers are connected to\nall the maps within a 2-by-2 spatial window. Finally, we will also provide test results using a larger\n1st layer representation (K1 = 4800 maps) to verify that the performance gains we achieve are not\nmerely the result of passing more projections of the data to the supervised classi\ufb01cation stage.\n\n4.1 CIFAR-10\n4.1.1 Learned 2nd-layer Receptive Fields and Features\n\nBefore we look at classi\ufb01cation results, we \ufb01rst inspect the learned features and their receptive \ufb01elds\nfrom the second layer (i.e., the features that take the pooled \ufb01rst-layer responses as their input).\nFigure 1 shows two typical examples of receptive \ufb01elds chosen by our method when using square-\ncorrelation as the similarity metric. In both of the examples, the receptive \ufb01eld incorporates \ufb01lters\nwith similar orientation tuning but varying phase, frequency and, sometimes, varying color. The\nposition of the \ufb01lters within each window indicates its location in the 2-by-2 region considered by\nthe learning algorithm. As we might expect, the \ufb01lters in each group are visibly similar to those\nplaced together by topographic methods like TICA that use related criteria.\n\nFigure 1: Two examples of receptive \ufb01elds chosen from 2-by-2-by-1600 image representations.\nEach box shows the low-level \ufb01lter and its position (ignoring pooling) in the 2-by-2 area considered\nby the algorithm. Only the most strongly dependent features from the T = 200 total features are\nshown. (Best viewed in color.)\n\nFigure 2: Most inhibitory (left) and excitatory (right) \ufb01lters for two 2nd-layer features. (Best viewed\nin color.)\n\nWe also visualize some of the higher-level features constructed by the vector quantization algorithm\nwhen applied to these two receptive \ufb01elds. The \ufb01lters obtained from VQ assign weights to each of\nthe lower level features in the receptive \ufb01eld. Those with a high positive weight are \u201cexcitatory\u201d\ninputs (tending to lead to a high response when these input features are active) and those with a\nlarge negative weight are \u201cinhibitory\u201d inputs (tending to result in low \ufb01lter responses). The 5 most\ninhibitory and excitatory inputs for two learned features are shown in Figure 2 (one from each\nreceptive \ufb01eld in Figure 1). For instance, the two most excitatory \ufb01lters of feature (a) tend to select\nfor long, narrow vertical bars, inhibiting responses of wide bars.\n\n4.1.2 Classi\ufb01cation Results\n\nWe have tested our method on the task of image recognition using the CIFAR training and testing\nlabels. Table 1 details our results using the full CIFAR dataset with various settings. We \ufb01rst note\nthe comparison of our 2nd layer results with the alternative of a single large 1st layer using an\nequivalent number of maps (4800) and see that, indeed, our 2nd layer created with learned receptive\n\ufb01elds performs better (81.2% vs. 80.6%). We also see that the random and single receptive \ufb01eld\nchoices work poorly, barely matching the smaller single-layer network. This appears to con\ufb01rm our\nbelief that grouping together similar features is necessary to allow our unsupervised learning module\n(VQ) to identify useful higher-level structure in the data. Finally, with a third layer of features, we\nachieve the best result to date on the full CIFAR dataset with 82.0% accuracy.\n\n7\n\n\fTable 1: Results on CIFAR-10 (full)\n\nArchitecture\n\n1 Layer\n1 Layer (4800 maps)\n2 Layers (Single RF)\n2 Layers (Random RF)\n2 Layers (Learned RF)\n3 Layers (Learned RF)\nVQ (6000 maps) [6]\nConv. DBN [13]\nDeep NN [4]\n\nAccuracy (%)\n78.3%\n80.6%\n77.4%\n77.6%\n81.2%\n82.0%\n81.5%\n78.9%\n80.49%\n\nArchitecture\n\nTable 2: Results on CIFAR-10 (400 ex. per class)\nAccuracy (%)\n64.6% (\u00b10.8%)\n63.7% (\u00b10.7%)\n65.8% (\u00b10.3%)\n65.8% (\u00b10.9%)\n69.2% (\u00b10.7%)\n70.7% (\u00b10.7%)\n66.4% (\u00b10.8%)\n64.4% (\u00b11.0%)\n\n1 Layer\n1 Layer (4800 maps)\n2 Layers (Single RF)\n2 Layers (Random RF)\n2 Layers (Learned RF)\n3 Layers (Learned RF)\nSparse coding (1 layer) [6]\nVQ (1 layer) [6]\n\nIt is dif\ufb01cult to assess the strength of feature learning methods on the full CIFAR dataset because\nthe performance may be attributed to the success of the supervised SVM training and not the unsu-\npervised feature training. For this reason we have also performed classi\ufb01cation using 400 labeled\nexamples per class.3 Our results for this scenario are in Table 2. There we see that our 2-layer\narchitecture signi\ufb01cantly outperforms our 1-layer system as well as the two 1-layer architectures de-\nveloped in [6]. As with the full CIFAR dataset, we note that it was not possible to achieve equivalent\nperformance by merely expanding the \ufb01rst layer or by using either of the alternative receptive \ufb01eld\nstructures (which, again, make minimal gains over a single layer).\n\n4.2 STL-10\n\nArchitecture\n\n1 Layer\n1 Layer (4800 maps)\n2 Layers (Single RF)\n2 Layers (Random RF)\n2 Layers (Learned RF)\n3 Layers (Learned RF)\nSparse coding (1 layer) [6]\nVQ (1 layer) [6]\n\nTable 3: Classi\ufb01cation Results on STL-10\nAccuracy (%)\nFinally, we also tested our algorithm on the\n54.5% (\u00b10.8%)\nSTL-10 dataset [5]. Compared to CIFAR, STL\n53.8% (\u00b11.6%)\nprovides many fewer labeled training examples\n55.0% (\u00b10.8%)\n(allowing 100 labeled instances per class for\n54.4% (\u00b11.2%)\neach training fold).\nInstead of relying on la-\n58.9% (\u00b11.1%)\nbeled data, one tries to learn from the provided\n60.1% (\u00b11.0%)\nunlabeled dataset, which contains images from\n59.0% (\u00b10.8%)\na distribution that is similar to the labeled set\n54.9% (\u00b10.4%)\nbut broader. We used the same architecture for\nthis dataset as for CIFAR, but rather than train our features each time on the labeled training fold\n(which is too small), we use 20000 examples taken from the unlabeled dataset. Our results are\nreported in Table 3.\nHere we see increasing performance with higher levels of features once more, achieving state-of-the-\nart performance with our 3-layered model. This is especially notable since the higher level features\nhave been trained purely from unlabeled data. We note, one more time, that none of the alternative\narchitectures (which roughly represent common practice for training deep networks) makes signi\ufb01-\ncant gains over the single layer system.\n\n5 Conclusions\n\nWe have proposed a method for selecting local receptive \ufb01elds in deep networks.\nInspired by\nthe grouping behavior of topographic learning methods, our algorithm selects qualitatively similar\ngroups of features directly using arbitrary choices of similarity metric, while also being compatible\nwith any unsupervised learning algorithm we wish to use. For one metric in particular (square cor-\nrelation) we have employed our algorithm to choose receptive \ufb01elds within multi-layered networks\nthat lead to successful image representations for classi\ufb01cation while still using only vector quanti-\nzation for unsupervised learning\u2014a relatively simple by highly scalable learning module. Among\nour results, we have achieved the best published accuracy on CIFAR-10 and STL datasets. These\nperformances are strengthened by the fact that they did not require the use of any supervised back-\npropagation algorithms. We expect that the method proposed here is a useful new tool for managing\nextremely large, higher-level feature representations where more traditional spatio-temporal local\nreceptive \ufb01elds are unhelpful or impossible to employ successfully.\n\n3Our networks are still trained unsupervised from the entire training set.\n\n8\n\n\fReferences\n[1] R. Adams, H. Wallach, and Z. Ghahramani. Learning the structure of deep sparse graphical\n\nmodels. In International Conference on AI and Statistics, 2010.\n\n[2] A. Bell and T. J. Sejnowski. The \u2018independent components\u2019 of natural scenes are edge \ufb01lters.\n\nVision Research, 37, 1997.\n\n[3] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In\n\nComputer Vision and Pattern Recognition, 2010.\n\n[4] D. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber.\n\nperformance neural networks\nhttp://arxiv.org/abs/1102.0183.\n\nfor visual object classi\ufb01cation.\n\nPre-print,\n\nHigh-\n2011.\n\n[5] A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsupervised feature\n\nlearning. In International Conference on AI and Statistics, 2011.\n\n[6] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and\n\nvector quantization. In International Conference on Machine Learning, 2011.\n\n[7] P. Garrigues and B. Olshausen. Group sparse coding with a laplacian scale mixture prior. In\n\nAdvances in Neural Information Processing Systems, 2010.\n\n[8] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with\n\nlocal receptive \ufb01elds, 2010.\n\n[9] F. Huang and Y. LeCun. Large-scale learning with SVM and convolutional nets for generic\n\nobject categorization. In Computer Vision and Pattern Recognition, 2006.\n\n[10] A. Hyvarinen, P. Hoyer, and M. Inki. Topographic independent component analysis. Neural\n\nComputation, 13(7):1527\u20131558, 2001.\n\n[11] A. Hyvarinen and E. Oja. Independent component analysis: algorithms and applications. Neu-\n\nral networks, 13(4-5):411\u2013430, 2000.\n\n[12] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architec-\n\nture for object recognition? In International Conference on Computer Vision, 2009.\n\n[13] A. Krizhevsky. Convolutional Deep Belief Networks on CIFAR-10. Unpublished manuscript,\n\n2010.\n\n[14] Y. LeCun, F. Huang, and L. Bottou. Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004.\n\n[15] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for\nscalable unsupervised learning of hierarchical representations. In International Conference on\nMachine Learning, 2009.\n\n[16] V. Nair and G. E. Hinton. Recti\ufb01ed Linear Units Improve Restricted Boltzmann Machines. In\n\nInternational Conference on Machine Learning, 2010.\n\n[17] N. Pinto, D. Doukhan, J. J. DiCarlo, and D. D. Cox. A high-throughput screening approach\nto discovering good forms of biologically inspired visual representation. PLoS Comput Biol,\n2009.\n\n[18] A. Saxe, P. Koh, Z. Chen, M. Bhand, B. Suresh, and A. Y. Ng. On random weights and\n\nunsupervised feature learning. In International Conference on Machine Learning, 2011.\n\n[19] D. Scherer, A. Mller, and S. Behnke. Evaluation of pooling operations in convolutional ar-\nchitectures for object recognition. In International Conference on Arti\ufb01cial Neural Networks,\n2010.\n\n[20] E. Simoncelli and O. Schwartz. Modeling surround suppression in v1 neurons with a sta-\ntistically derived normalization model. Advances in Neural Information Processing Systems,\n1998.\n\n[21] K. Zhang and L. Chan. Ica with sparse connections. Intelligent Data Engineering and Auto-\n\nmated Learning, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1368, "authors": [{"given_name": "Adam", "family_name": "Coates", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}