{"title": "Intrinsic dimension of data representations in deep neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6111, "page_last": 6122, "abstract": "Deep neural networks progressively transform their inputs across multiple processing layers. What are the geometrical properties of the representations learned by these networks? Here we study the intrinsic dimensionality (ID) of data\nrepresentations, i.e. the minimal number of parameters needed to describe a representation. We find that, in a trained network, the ID is orders of magnitude smaller than the number of units in each layer. Across layers, the ID first increases and then progressively decreases in the final layers. Remarkably, the ID of the last hidden layer predicts classification accuracy on the test set. These results can neither be found by linear dimensionality estimates (e.g., with principal component analysis), nor in representations that had been artificially linearized. They are neither found in untrained networks, nor in networks that are trained on randomized labels. This suggests that neural networks that can generalize are those that transform the data into low-dimensional, but not necessarily flat manifolds.", "full_text": "Intrinsic dimension of data representations in deep\n\nneural networks\n\nInternational School for Advanced Studies\n\nInternational School for Advanced Studies\n\nTechnical University of Munich\n\nInternational School for Advanced Studies\n\nAlessio Ansuini\n\nalessioansuini@gmail.com\n\nJakob H. Macke\n\nmacke@tum.de\n\nAlessandro Laio\n\nlaio@sissa.it\n\nDavide Zoccolan\n\nzoccolan@sissa.it\n\nAbstract\n\nDeep neural networks progressively transform their inputs across multiple pro-\ncessing layers. What are the geometrical properties of the representations learned\nby these networks? Here we study the intrinsic dimensionality (ID) of data-\nrepresentations, i.e. the minimal number of parameters needed to describe a repre-\nsentation. We \ufb01nd that, in a trained network, the ID is orders of magnitude smaller\nthan the number of units in each layer. Across layers, the ID \ufb01rst increases and then\nprogressively decreases in the \ufb01nal layers. Remarkably, the ID of the last hidden\nlayer predicts classi\ufb01cation accuracy on the test set. These results can neither be\nfound by linear dimensionality estimates (e.g., with principal component analysis),\nnor in representations that had been arti\ufb01cially linearized. They are neither found\nin untrained networks, nor in networks that are trained on randomized labels. This\nsuggests that neural networks that can generalize are those that transform the data\ninto low-dimensional, but not necessarily \ufb02at manifolds.\n\n1\n\nIntroduction\n\nDeep neural networks (DNNs), including convolutional neural networks (CNNs) for image data, are\namong the most powerful tools for supervised data classi\ufb01cation. In DNNs, inputs are sequentially\nprocessed across multiple layers, each performing a nonlinear transformation from a high-dimensional\nvector to another high-dimensional vector. Despite the empirical success and widespread use of\nDNNs, we still have an incomplete understanding about why and when they work so well \u2013 in\nparticular, it is not clear yet why they are able to generalize well to unseen data, not withstanding\ntheir massive overparametrization (1). While progress has been made recently [e.g. (2; 3)], guidelines\nfor selecting architectures and training procedures are still largely based on heuristics and domain\nknowledge.\nA fundamental geometric property of a data representation in a neural network is its intrinsic\ndimension (ID), i.e., the minimal number of coordinates which are necessary to describe its points\nwithout signi\ufb01cant information loss. It is widely appreciated that deep neural networks are over-\nparametrized, and that there is substantial redundancy amongst the weights and activations of deep\nnets \u2013 e.g., several studies in network compression have shown that many weights in deep neural\nnetworks can be pruned without signi\ufb01cant loss in classi\ufb01cation performance (4; 5). Linear estimates\nof the ID in DNNs have been computed theoretically and numerically in simpli\ufb01ed models (6), and\nlocal estimates of the ID developed in (7) have been related to robustness properties of deep networks\nto adversarial attacks (8; 9), showing that a low local intrinsic dimension correlates positively with\nrobustness. Local ID of object manifolds can also be estimated at several locations on the tangent\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fspace, and was found to decrease along the last hidden layers of AlexNet (10; 11). Both linear and\nnonlinear dimensionality reduction techniques have been used extensively to visualize computations\nin deep networks (12; 13; 14). Local ID in the last hidden layer of deep networks has been found to\nbe much lower than the dimension of the embedding space (15), consistently with the results of (16),\nin which estimates of the ID is used to signal the onset of over\ufb01tting in the presence of noisy labels,\nthus showing a connection between intrinsic dimension and generalization during training of speci\ufb01c\nmodels.\nHowever, there has not been a direct and systematic characterization of how the intrinsic dimension\nof data manifolds varies across the layers of CNNs and how it relates to generalization across a wide\nvariety of architectures. We here leverage TwoNN (17), a recently developed estimator for global\nID that exploits the fact that nearest-neighbour statistics depend on the ID (18) (see Fig. 1 for an\nillustration). TwoNN can be applied even if the manifold containing the data is curved, topologically\ncomplex, and sampled non-uniformly. This procedure is not only accurate, but also computationally\nef\ufb01cient. In a few seconds on a desktop PC it provides the estimate of the ID of a data set with O(104)\ndata, each with O(105) coordinates (for example the activations in an intermediate layer of a CNN),\nthus making it possible to map out ID across multiple layers and networks. Using this estimator, we\ninvestigated the variation of the ID along the layers of a wide range of deep neural networks trained\nfor image classi\ufb01cation. Speci\ufb01cally, we addressed the following questions:\n\nlow-dimensional manifolds, or, conversely, seek to expand the dimensionality?\n\n\u2022 How does the ID change along the layers of CNNs? Do CNNs compress representations into\n\u2022 How different is the actual ID from the \u2018linear\u2019 dimensionality of a network, i.e., the dimen-\nsionality of the linear subspace containing the data-manifold? A substantial mismatch would\nindicate that the underlying manifolds are curved rather than \ufb02at.\n\u2022 How is the ID of a network related to its generalization performance? Can we \ufb01nd empirical\nsignatures of generalization performance in the geometrical structure of the representations?\n\nOur analyses show that data representations\nin CNNs are embedded in manifolds of low\ndimensionality, which is typically several\norders of magnitude lower than the dimen-\nsionality of the embedding space (the num-\nber of units in a layer). In addition, we\nfound that the variation of the ID along the\nhidden layers of CNNs follows a similar\ntrend across different architectures \u2013 the\nearly layers expand the dimensionality of\nthe representations, followed by a mono-\ntonic decrease that brings the ID to reach\nlow values in the \ufb01nal layers.\nMoreover, we observed that, in networks\ntrained to classify images, the ID of the\ntraining set in the last hidden layer is an\naccurate predictor of the network\u2019s classi-\n\ufb01cation accuracy on the test set \u2013 i.e, the\nlower the ID in this layer, the better the\nnetwork capability of correctly classifying\nthe image categories in a test set. Conversely, in the last hidden layer, the ID remains high for a\nnetwork trained on non predictable data (i.e., with permuted labels), on which the network is forced\nto memorize rather than generalize. These geometrical properties of representations in trained neural\nnetworks were empirically conserved across multiple architectures, and might point to an operating\nprinciple of deep neural networks.\n\nFigure 1: The TwoNN estimator derives an estimate of\nintrinsic dimensionality from the statistics of nearest-\nneighbour distances.\n\n2 Estimating the intrinsic dimension of data representations\n\nInferring the intrinsic dimension of high-dimensional and sparsely sampled data representations is\na challenging statistical problem. To estimate the ID of data-representations in deep networks, we\nleverage a recently developed global ID-estimator (\u2018TwoNN\u2019) that is based on computing the ratio\n\n2\n\nri,1ri,2Activation x1Activation x21) For each data point icompute the distance to its first and second neighbour (ri,1and ri,2)2) For each icompute \ud835\udf07The probability distribution of mis where d is the ID, independently on the local density of points. 3) Infer d from the empirical probability distribution of all the mi.D= # of nodes in the layer = 2Intrinsic dimension \u224514) Repeat the calculation selecting a fraction of points at random. This gives the ID as a function of the scale.\fbetween the distances to the second and \ufb01rst nearest neighbors (NN) of each data point (17) (see Fig.\n1). This allows overcoming the problems related to the curvature of the embedding manifold and to\nthe local variations in the density of the data points, under the weak assumption that the density is\nconstant on the scale of the distance between each point and its second nearest neighbor.\nFormally, let points xi be uniformly sampled on a manifold with intrinsic dimension d and let N\nbe the total number of points. Let r(1)\nbe the distances of the \ufb01rst and second neighbor\n.\ni /r(1)\n= r(2)\nof i respectively. Then \u00b5i\n, i = 1, 2, ..., N follows a Pareto distribution with parameter\n\u2212(d+1)\nd + 1 on [1, +\u221e), f (\u00b5i|d) = d\u00b5\n. Taking advantage of this observation, we can formulate the\ni\nlikelihood of vector \u00b5\u00b5\u00b5\n\n.\n= (\u00b51, \u00b52, ..., \u00b5N ) as\n\ni\n\nand r(2)\n\ni\n\ni\n\nN(cid:89)\n\nP (\u00b5\u00b5\u00b5|d) = dN\n\n\u2212(d+1)\n\u00b5\ni\n\n.\n\n(1)\n\ni=1\n\nAt this point d can be easily computed, for instance by maximizing the likelihood, or, following (17),\nby employing the empirical cumulate of the distribution of the \u00b5 values to reduce the ID estimation\ntask to a linear regression problem. Indeed, the ID can also be estimated by restricting the product in\neq. 1 to non-intersecting triplets of points, for which independence is strictly satis\ufb01ed, but, as shown\nin ref. (17), in practice this does not signi\ufb01cantly affect the estimate.\nThe ID estimated by this approach is asymptotically correct even for samples harvested from highly\nnon-uniform probability distributions. For a \ufb01nite number of data points, the estimated values remain\nvery close to the ground truth ID, when this is smaller than \u223c 20. For larger IDs and \ufb01nite sample\nsize, the approach moderately underestimates the correct value, especially if the density of data is\nnon-uniform. Therefore, the values reported in the following \ufb01gures, when larger \u223c 20, should be\nconsidered as lower bounds.\nFor real-world data, the intrinsic dimension always depends on the scale of distances on which the\nanalysis is performed. This implies that the reliability of the dimensionality estimate needs to be\nassessed by measuring the intrinsic dimension at different scales and by checking whether it is, at\nleast approximately, scale invariant (17). In our analyses, this test was performed by systematically\ndecimating the dataset, thus gradually reducing its size. The ID was then estimated on the reduced\nsamples, in which the average distance between data points had become progressively larger. This\nallowed estimating the dependence of the ID on the scale. As explained in (17), if the ID is well-\nde\ufb01ned, its estimated value will only depend weakly on the number of data points N; in particular it\nwill be not severely affected by the presence of \u201chubs\u201d, since the decimation procedure would kill\nthem (see Fig. 2B).\nTo test the reliability of our ID estimator on embedding spaces with a dimension comparable to that\nfound in the layers of a deep network, we performed tests on arti\ufb01cial data of known ID, embedded\nin a 100,000 dimensional space. The test did not reveal any signi\ufb01cant degradation of the accuracy.\nIndeed, the ID estimator is sensitive only to the value of the distances between pair of points, and this\ndistance does not depend on the embedding dimension.\nFor computational ef\ufb01ciency, we analyzed the representations of a subset of layers. We extracted\nrepresentations at pooling layers after a convolution or a block of consecutive convolutions, and at\nfully connected layers. In the experiments with ResNets, we extracted the representations after each\nResNet block (19) and the average pooling before the output. See A.1 for details.\nThe code to compute the ID estimates with the TwoNN method and to reproduce our experiments is\navailable at this repository.\n\n3 Results\n\n3.1 The intrinsic dimension exhibits a characteristic shape across several networks\n\nOur \ufb01rst goal was to empirically characterize the ID of data representations in different layers of deep\nneural networks. Given a layer l of a DNN, an individual data point (e.g., an image) is mapped onto\nthe set of activations of all the nl units of the layer, which de\ufb01ne a point in a nl-dimensional space.\nWe refer to nl as the embedding dimension (ED) of the representation in layer l. A set of N input\n\n3\n\n\fFigure 2: Modulation of ID across hidden layers of deep convolutional networks A) ID across\nlayers of VGG-16-R, error bars are the standard deviation of the ID (see A.1). Numbers in plot\nindicate embedding dimensionality of each layer. B Subsampling analysis on VGG-16-R experiment,\nreported for the same layers as in the inset in A (see A.1 for details).\n\nsamples (e.g., N images) generate, in each layer l, a set of N nl-dimensional points. We estimated\nthe dimension of the manifold containing these points using TwoNN.\nWe \ufb01rst investigated the variation of the ID across the layers of a VGG-16 network (20), pre-trained\non ImageNet (11), and \ufb01ne-tuned and evaluated on a synthetic data-set of 1440 images (21). The\ndataset consisted of 40 3D objects, each rendered in 36 different views (we left out 6 images for each\nobject as a test set) \u2013 it thus spanned a spectrum of different appearances, but of a small number\nof underlying geometrical objects. When estimating the ID of data representations on this network\n(referred to as \u2018VGG-16-R\u2019), we found that the ID \ufb01rst increased in the \ufb01rst pooling layer, before\nsuccessively and monotonically decreasing across the following layers, reaching very low values in\nthe \ufb01nal hidden layers (Fig. 2A). For instance, in the fourth layer of pooling (pool4) of VGG-16-R,\nID \u2248 19 and ED \u2248 105, with ID\nED \u2248 2 \u00d7 10\u22124, which is consistent with the values reported by (15)\nusing a different ID estimator (22).\nOne potential concern is whether the number of stimuli is suf\ufb01cient for the ID-estimate to be robust.\nTo investigate this, we repeated the analysis on subsamples randomly chosen on the data manifold,\n\ufb01nding that the estimated IDs were indeed stable across a wide range of sample sizes (Fig. 2B). We\nnote that, for the early/intermediate layers, the reported values of the ID are likely a lower bound to\nthe real ID (see discussion in (17)).\nAre the \u2018hunchback\u2019 shape of the ID variation across the layers (i.e., the initial steep increase followed\nby a gradual monotonic decrease), and the overall low values of the ID, speci\ufb01c to this particular\nnetwork architecture and dataset? To investigate this question, we repeated these analyses on several\nstandard architectures (AlexNet, VGG and ResNet) pre-trained on ImageNet (23). Speci\ufb01cally, we\ncomputed the average ID of the object manifolds corresponding to the 7 biggest ImageNet categories,\nusing 500 images per category (see section A.1). We found both the hunchback-shape and the low\nIDs to be preserved across all networks (Fig. 3A): the ID initially grew, then reached a peak or a\nplateau and, \ufb01nally, progressively decreased towards its \ufb01nal value. As shown in Fig. 8 for AlexNet,\nsuch pro\ufb01le of ID variation across layers was generally consistent across object classes.\nThe ID in the output layer was the smallest, often assuming a value of the order of ten. Such a low\nvalue is to be expected, given that the ID of the output layer of a network capable of recognizing Nc\ncategories is bound by the condition Nc \u2264 2ID, which implies that each category is associated with a\nbinary representation, and that the output layer optimally encodes this representation. For the \u223c 1000\ncategories of ImageNet, this bound becomes ID (cid:38) 10, a value consistent with those observed in all\nthe networks we considered.\nIs the relative (rather than the absolute) depth of a layer indicative of the ID? To investigate this,\nwe plotted ID against relative depth (de\ufb01ned as the absolute depth of the layer divided by the total\nnumber of layers, not counting batch normalization layers (12)) of the 14 models belonging to the\nthree classes of networks (Fig. 3B). Remarkably, the ID pro\ufb01les approximately collapsed onto a\n\n4\n\n01020304050607056789101112inputpool1pool2pool3pool4pool5d1d2outputpool5d1d2161284901442403604807201440outputd1d2pool5ABIDIDnumber of data pointsID1505288028164014082007041003522508840964096404096409625088VGG-16-Routput40\fFigure 3: ID of object manifolds across networks. A) IDs of data representations for 4 networks:\neach point is the average of the IDs of 7 object manifolds. The error bars are the standard deviations\nof the ID across the single object\u2019s estimates (see A.1). B) The ID as a function of the relative depth\nin 14 deep convolutional networks spanning different sizes, architectures and training techniques.\nDespite the wide diversity of these models, the ID pro\ufb01le follows a typical hunchback shape (error\nbars not shown).\n\ncommon hunchback shape 1, despite considerable variations in the architecture, number of layers, and\noptimization algorithms. For networks belonging to the VGG and ResNet families, the rising portions\nof the ID pro\ufb01les substantially overlapped, with the ID reaching similar large peak values (between\n100 and 120) in the relative depth range 0.2-0.4. The dependence on relative depth is consistent with\nthe results of (12), where it was observed that similarity between layers depended on relative depth.\nNotably, in all networks the ID eventually converged to small values in the last hidden layer. These\nresults suggest that state-of-the-art deep neural networks \u2013 after an initial increase in ID \u2013 perform\na progressive dimensionality reduction of the input feature vectors, a result with is consistent with\nthe information-theoretical analysis in (24). Based on previous \ufb01ndings about the evolution of the\nID in the last hidden layer during training (16), one could speculate that this progressive, gradual\nreduction of dimensionality of data-manifolds is a feature of deep neural networks which allows them\nto generalize well. In the following, we will investigate this idea further by showing that the ID of the\nlast hidden layers predicts generalization performance, and by showing that these properties cannot\nbe found in networks with random weights or trained on non predictable data.\n\n3.2 The intrinsic dimension of the last hidden layer predicts classi\ufb01cation performance\n\nAlthough the hunchback shape was preserved across networks, the IDs in the last hidden layers were\nnot exactly the same for all the networks. To better resolve such differences, we computed the ID in\nthe last hidden layer of each network using a much larger pool of images of the training set (\u223c 2, 000),\nsampled from all ImageNet categories (see section A.1). This revealed a spread of ID values, ranging\nbetween \u2248 12 (for ResNet152) and \u2248 25 (for AlexNet, Fig. 4). These differences may appear small,\ncompared to the much larger size of the embedding space in the last hidden layer (where the ED was\nbetween 1 and 2 orders of magnitude larger than the ID (range = [512 \u2212 4096]). However, the ID\nin the last hidden layer on the training set was indeed a strong predictor of the performance of the\nnetwork on the test set, as measured by top 5-score (Fig. 4, Pearson correlation coef\ufb01cient r = 0.94).\nA tight correlation was found not only across the full set of networks, but also within each class of\narchitectures, when such comparison was possible \u2013 i.e., in the classes of the VGG with and without\nbatch normalization and ResNets (r = 0.99 in the latter case, see inset in Fig. 4).\n\n1with the exception of AlexNet, and a small network trained on MNIST in a separate analysis, see section\n\n3.4 for details and analysis\n\n5\n\n20406080100120140160005101520253035AlexNetVGG-16ResNet-18ResNet-34layer depthID0.00.20.40.60.81.0204060801001201401600relative depthAlexNetVGG-11VGG-13VGG-16VGG-19VGG-11-bnVGG-13-bnVGG-16-bnVGG-19-bnResNet-18ResNet-34ResNet-50ResNet-101ResNet-152ABID\fOverall, this analysis suggests that the ID in the last hidden layer can be used as a proxy for the\ngeneralization ability of a network. Importantly, this proxy can be measured without estimating the\nperformance on an external validation set.\n\n3.3 Data representations lie on curved manifolds\n\nFigure 4: ID of the last hidden layer predicts\nperformance. The ID of data representations\n(training set) predicts the top 5-score performance\non the test set. Inset Detail for the ResNet class.\n\nThe strength of the TwoNN method lies in its\nability to infer the ID of data representations,\neven if they lie on curved manifolds. This raises\nthe question of whether our observations (low\nIDs, hunchback shapes, correlation with test-\nerror) re\ufb02ect the fact that data points live on\nlow-dimensional, yet highly curved manifolds,\nor, simply, in low-dimensional, but largely \ufb02at\n(linear) subspaces.\nTo test this, we performed linear dimensionality\nreduction (principal component analysis, PCA)\non the normalized covariance matrix (i.e., the\nmatrix of correlation coef\ufb01cients \u2013 using the\nraw covariance resulted in qualitatively similar\nresults) for each layer and network. We did\nnot \ufb01nd a clear gap in the eigenvalue spectrum\n(Fig. 5A), a result that is qualitatively consistent\nwith that obtained for stimulus-representations\nin primary visual cortex (25). The absence of a\ngap in the spectrum, with the magnitude of the\neigenvalues smoothly decreasing as a function\nof their rank, is, by itself, an indication that the\ndata manifolds are not linear. Nevertheless, we de\ufb01ned an \u2018ad-hoc\u2019 estimate of dimensionality by\ncomputing the number of components that should be included to describe 90% of the variance in the\ndata. In what follows, we call this number PC-ID. We found PC-ID to be about one or two orders of\nmagnitude larger than the value of the ID computed with TwoNN. For example, the PC-ID in the last\nhidden layer of VGG-16 was \u2248 200 (Fig. 5C, solid red line), while the ID estimated with TwoNN\nwas \u2248 18 (solid black line).\nThe discrepancy between the ID estimated with TwoNN and with PCA points to the existence of\nstrong non-linearities in the correlations between the data, which are not captured by the covariance\nmatrix. To verify that this was indeed the case (and, e.g., not a consequence of estimation bias),\nwe used TwoNN to compare the ID of the last hidden layer of VGG-16 with the ID of a synthetic\nGaussian dataset with the same second-order moments. The ID of the original dataset was low and\nstable as a function of the size N of the data sample used to estimate it (Fig. 5B, black curve; similar\nsubsampling analysis as previously shown in Fig. 2B). In contrast, the ID of the synthetic dataset was\ntwo orders of magnitude larger, and grew with N (Fig. 5B, red curve), as expected in the case of an\nill-de\ufb01ned estimator (17).\nWe also computed the PC-ID of the object manifolds across the layers of VGG-16 on randomly\ninitialized networks, and we found that its pro\ufb01le was qualitatively the same as in trained networks\n(compare solid and dashed red curves in Fig. 5C). By contrast, when the same comparison was\nperformed on the ID (as computed using TwoNN), the trends obtained on random weights (dashed\nblack curve) and after training the network (solid black curve) were very different. While the latter\nshowed the hunchback pro\ufb01le (same as in Fig. 3), the former was remarkably \ufb02at. This behaviour\ncan be explained by observing that the ID of the input is very low (see section 3.4 for a discussion\nof this point). For random weights, each layer effectively performs an orthogonal transformation,\nthus preserving such low ID across layers. Importantly, the hunchback pro\ufb01le observed for the ID\nin trained networks (Figs 2A, 3A,B) is a genuine result of training, which does not merely re\ufb02ect\nthe initial expansion of the ED from the input to the \ufb01rst hidden layers, as shown by the fact that, in\nVGG-16, the ID kept growing after that the ED had already started to substantially decline (compare\nthe solid black and blue curves in Fig. 5C).\n\n6\n\n6810121416182022top 5 error (%)12161820222426ID14191817161514ID131267891011top 5 error (%)ResNet5125122048204820481834521011521113161918345210115211131619AlexNetVGGResNetVGG-bn\fThe analysis shown in Fig. 5C also indicates that intermediate layers and the last hidden layer\nundergo opposite trends, as the result of training (compare the solid and dashed black curves): while\nthe ID of the last hidden layer is reduced with respect to its initial value [consistently with what\nreported in (16)], the ID of intermediate layers increases by a large amount. This prompted us to run\nan exploratory analysis to monitor the ID evolution during training in a VGG-16 network trained\nwith CIFAR-10. We observed a behavior that was consistent with that already reported in Fig. 5C:\na substantial increase of the ID in the initial/intermediate layers, and a decrease in the last layers\n(Fig. 9A, black vs. orange curve). Interestingly, a closer inspection of the dynamics in the last hidden\nlayer revealed a non-monotonic variation of the ID (see Fig. 9B,C). Here, after an initial drop, the ID\nslowly increased, but, differently from (16), without resulting in substantial over\ufb01tting. Thus, the\nevolution of the ID during learning appears to be not strictly monotonic and its trend likely depends\non the speci\ufb01c architecture and dataset, calling for further investigation.\n\nFigure 5: Evidence that data-representations are on curved manifolds A) Variance spectra of last\nhidden layer do not show a clear gap. B) ID in the last hidden layer of VGG-16 (black), compared\nwith the ID of a synthetic Gaussian dataset with the same size and second-order correlations structure\n(red). C) The ID and the PC-ID along the layers of VGG-16 for a trained network and an untrained,\nrandomly initialized network. The ED, rescaled to reach the maximum at 400, is shown in blue.\n\n3.4 The initial increase in intrinsic dimension can arise from irrelevant features\n\nWe generally found the ID to increase in the initial layers. However, this was not observed for a\nsmall network trained on the MNIST data-set (Fig. 6B, black curve) and was also less pronounced\nfor AlexNet (Fig. 3A, orange curve). A mechanism underlying the initial ID rise could be the fact\nthat the input is dominated by features that are irrelevant for predicting the output, but are highly\ncorrelated between each other. To validate this hypothesis, we generated a modi\ufb01ed MNIST dataset\n(referred to as MNIST(cid:63)) by adding a luminance perturbation that was constant for all pixels within an\nimage, but random across the various images (Fig. 6A). Given an image i with pixels of xi \u2208 RN\n(where N is the number of pixels), we added shared random perturbations, xi \u2192 x(cid:63)\ni = xi + \u03bb\u03bei\nwhere \u03bb is a positive parameter and \u03bei are i.i.d. uniformly distributed random variables in the range\n[0, 1]. This perturbation has the effect of stretching the dataset along a speci\ufb01c direction in the input\nspace (the vector [1, 1, . . . , 1]) thus reducing the ID of the data manifold in the input layer. Indeed,\nwith \u03bb = 100, the ID of the input representation dropped from \u2248 13 (its original value) to \u2248 3.\nThe network trained on MNIST(cid:63) was still able to generalize (accuracy \u2248 98%). However, the\nvariation of the ID (blue curve in Fig. 6B) now showed a hunchback shape reminiscent of that already\nobserved in Figs 2A and 3A,B for large architectures. This suggests that the growth of the ID in the\n\ufb01rst hidden layers of a deep network is determined by the presence in the input data of low-level\nfeatures that carry no information about the correct labeling \u2013 for instance, in the case of images,\ngradients of luminance or contrast. One can speculate that, in a trained deep network, the \ufb01rst layers\nprune the irrelevant features, formatting the representation for the more advanced processing carried\nout by the last layers (24). The initial increase of the dimensionality of the data manifold could be\nthe signature of such pruning. This notion is consistent with recent evidence gathered in the \ufb01eld of\nvisual neuroscience, where the pruning of low-level confounding features, such as luminance, has\n\n7\n\nID originalID synthetic10010100010000number of data pointsABCeigenvalue rank eigenvaluesintrinsic dimensionintrinsic dimension0.00.20.40.60.81.0relative depth0100200300400ID trainedPC-ID trainedID untrainedPC-ID untrainedED (rescaled)11010010100 ResNets 18,34ResNets 50,101,152AlexNetVGGsVGGs-bn\fbeen demonstrated along the progression of visual cortical areas that, in the rat brain, are thought to\nsupport shape processing and object recognition (26).\n\nFigure 6: A The addition of a luminance gradient across the images of the MNIST dataset results in a\nstretching of the image manifold along a straight line in the input space of the pixel representation.\nB Change of the ID along all the layers of the MNIST network, as obtained in three different\nexperiments: 1) with the original MNIST dataset (black curve) 2) with the luminance-perturbed\nMNIST(cid:63) dataset (blue curve) and 3) with the MNIST\u2020, in which the label of the MNIST images\nwhere randomly shuf\ufb02ed (red curve).\n\n3.5 A network trained on random labels does not show the characteristic hunchback pro\ufb01le\n\nof ID variation\n\nIn untrained networks the ID pro\ufb01le is largely \ufb02at (Fig. 5C). Are there other circumstances in which\nthe ID pro\ufb01le deviates from the typical hunchback shape of Figs 2A and 3A,B, with IDs that do not\ndecrease progressively towards the output? It turns out that this is the case when generalization is\nimpossible by construction, as we veri\ufb01ed by randomly shuf\ufb02ing the labels on MNIST (we refer to\nthe shuf\ufb02ed data as MNIST\u2020).\nIt has been shown (1) that deep networks can perfectly \ufb01t the training set on randomly labelled data,\nwhile necessarily achieving chance level performance on the test set. As a result, when we trained\nthe same network as in section 3.4 on MNIST\u2020, we achieved a training error of zero. However, we\nfound that the network had an ID pro\ufb01le which did not decrease monotonically (orange curve in Fig.\n6B) \u2013 in contrast to the same network trained with the original dataset (black curve). Instead, it grew\nconsiderably in the second half of the network, almost saturating the upper bound, which is set by\nthe ED, in the output layer. This suggests that the reduction of the dimensionality of data manifolds\nin the last layers of a trained network re\ufb02ects the process of learning on a generalizable dataset. By\ncontrast, over\ufb01tting noisy labels leads to an expansion of the dimensionality, as already reported in\n(16). As suggested in that study, this indicates that a network trained on inconsistent data can be\nrecognized without estimating its performance on a test set, but by simply looking at whether the ID\nincreases substantially across its \ufb01nal layers.\n\n4 Conclusions and Discussion\n\nConvolutional neural networks, as well as their biological counterparts, such as the visual system of\nprimates (27) and other species (26; 28), transform the input images across a progression of processing\nstages, eventually providing an explicit (i.e. transformation-tolerant) representation of visual objects\nin the output layer. Leading theories in the \ufb01eld of visual neuroscience postulate that such re-\nformatting gradually untangles and \ufb02attens the manifolds produced by the different images within\nthe representational space de\ufb01ned by the activity of all the neurons (or units) in a layer (27; 29; 30).\nThis suggests that the dimensionality of the object manifolds may progressively decrease along\nthe layers of a deep network, and that such a decrease may be at the root of the high classi\ufb01cation\naccuracy achieved by deep networks. Although previous theoretical and empirical studies have\nprovided support to this hypothesis using small/simple network architectures or focusing on single\nlayers of large networks (6; 10; 16; 31; 32; 33), our study is the \ufb01rst to investigate systematically how\n\n8\n\n\fthe dimensionality of individual object manifolds - or mixtures of object manifolds - vary in large,\nstate-of-the-art CNNs used for image classi\ufb01cation.\nOur results can be summarized by making reference to the cartoon shown in Fig. 7. We found that the\nID in the initial layer of a network is low. As shown in Fig. 6, this can be explained by the existence\nof gradients of correlated low-level visual features (e.g., luminance, contrast, etc.) across the image\nset (34), resulting in a stretching of image representations along a few, main variation axes within the\ninput space (see Fig. 7A). Early layers of DNNs appear to get rid of these correlations, which are\nirrelevant for the classi\ufb01cation task, thus leading to an increase of the ID of the object manifolds (Fig.\n2A and 3A,B). As illustrated in Fig. 7B, this can be thought as a sort of whitening of the input data.\nSuch initial dimensionality-expansion is also thought to be performed in the visual system (30; 34),\nand is consistent with a recent characterization of the dimensionality of image representations in\nprimary visual cortex (25) and with the pruning of low-level information performed by high-order\nvisual cortical areas (26).\nAfter this initial expansion, the representation is\nsqueezed into manifolds of progressively lower ID\n(Figs 2, 3A,B), as graphically illustrated in Fig.\n7C,D). This phenomenon has been already observed\nby (31) and (6) on simpli\ufb01ed datasets and architec-\ntures, by (10) in the \ufb01nal, fully connected layers of\nAlexNet, and by (16) in the last hidden layers of two\ndifferent DNNs, where the ID evolution was tracked\nduring training. We here demonstrate that this pro-\ngressive reduction of the dimension of data manifolds\nis a general behavior and a key signature of every\nCNN we tested \u2013 both small toy models (Fig. 6B)\nand large state-of-the-art networks (Fig. 3A,B). More\nimportantly, our experiments show that the extent\nto which a deep network is able to compress the di-\nmensionality of data representations in the last hidden\nlayer is a key predictor of its ability to generalize well\nto unseen data (Fig. 4) \u2013 a \ufb01nding that is consistent\nwith the inverse relationship between ID and accu-\nracy reported by (16), although our pilot tests suggest\nthat the ID, after a large, initial drop, can slightly in-\ncrease during training without producing over\ufb01tting\n(Fig. 9). From a theoretical standpoint, this result is\nbroadly consistent with recent studies linking the clas-\nsi\ufb01cation capacity of data manifolds by perceptrons\nto their geometrical properties (32; 35). Our \ufb01ndings\nalso resonate with the compression of the information\nabout the input data during the \ufb01nal phase of training\nof deep networks (36), which is progressively larger\nas a function of the layer\u2019s depth, thus displaying a trend that is reminiscent of the one observed for\nthe ID in our study.\nFinally, our experiments also show that the ID values are lower than those identi\ufb01ed using PCA,\nor on \u2018linearized\u2019 data, which is an indication that the data lies on curved manifolds. In addition,\nID measures from PCA did not qualitatively distinguish between trained and randomly initialized\nnetworks (Fig. 5C). This conclusion is at odds with the unfolding of data manifolds reported by (33)\nacross the layers of a small network tested with simple datasets. It also suggests a slight twist on\ntheories about transformations in the visual system (27; 29) \u2013 it indicates that a \ufb02attening of data\nmanifolds may not be a general computational goal that deep networks strive to achieve: progressive\nreduction of the ID, rather than gradual \ufb02attening, seems to be the key to achieving linearly separable\nrepresentations.\nTo conclude, we hope that data-driven, empirical approaches to investigate deep neural networks,\nlike the one implemented in our study, will provide intuitions and constraints, which will ultimately\ninspire and enable the development of theoretical explanations of their computational capabilities.\n\nFigure 7: A. Input layer. The intrinsic dimen-\nsionality of the data can assume low values\ndue to the presence of irrelevant features un-\ncorrelated with the ground truth. B. The \ufb01rst\nhidden layers pre-process the data raising its\nintrinsic dimension. C,D. The representation\nis squeezed onto manifolds of progressively\nlower intrinsic dimension. These manifolds\nare typically not hyperplanes. D. In the last\nhidden layer (D) the ID shows a remarkable\ncorrelation with the performance in trained\nnetworks. E. The output layer.\n\n9\n\n\fAcknowledgments\n\nWe thank Eis Annavini for providing the custom dataset described in A.1.1; Artur Speiser and Jan-\nMatthis L\u00fcckmann for a careful proofreading of the manuscript. We warmly thank Elena Facco for\nher valuable help in the early phases of this project. We also thank Naftali Tishby, Riccardo Zecchina,\nMatteo Marsili, Tim Kietzmann, Florent Krzkala, Lenka Zdeborova, Fabio Anselmi, Luca Bortolussi,\nJim DiCarlo and SueYeon Chung for useful discussions and suggestions, and the anonymous referees\nfor their useful and constructive comments.\nThis work was supported by a European Research Council (ERC) Consolidator Grant, 616803-\nLEARN2SEE (D.Z.). JHM is funded by the German Research Foundation (DFG) through SFB 1233\n(276693517), SFB 1089 and SPP 2041, the German Federal Ministry of Education and Research\n(BMBF, project \u2018ADMIMEM\u2019, FKZ 01IS18052 A-D), and the Human Frontier Science Program\n(RGY0076/2018).\n\nReferences\n[1] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, \u201cUnderstanding deep learning requires\n\nrethinking generalization,\u201d CoRR, vol. abs/1611.03530, 2016.\n\n[2] B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro, \u201cTowards understanding the role\nof over-parametrization in generalization of neural networks,\u201d arXiv preprint arXiv:1805.12076,\n2018.\n\n[3] A. K. Lampinen and S. Ganguli, \u201cAn analytic theory of generalization dynamics and transfer\n\nlearning in deep linear networks,\u201d arXiv preprint arXiv:1809.10374, 2018.\n\n[4] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. De Freitas, \u201cPredicting parameters in deep\n\nlearning,\u201d in Advances in Neural Information Processing Systems, pp. 2148\u20132156, 2013.\n\n[5] Y. LeCun, J. S. Denker, and S. A. Solla, \u201cOptimal brain damage,\u201d in Advances in neural\n\ninformation processing systems, pp. 598\u2013605, 1990.\n\n[6] H. Huang, \u201cMechanisms of dimensionality reduction and decorrelation in deep neural networks,\u201d\n\nPhysical Review E, vol. 98, no. 6, p. 062313, 2018.\n\n[7] L. Amsaleg, O. Chelly, T. Furon, S. Girard, M. E. Houle, K.-i. Kawarabayashi, and M. Nett, \u201cEs-\ntimating local intrinsic dimensionality,\u201d in Proceedings of the 21th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pp. 29\u201338, ACM, 2015.\n\n[8] L. Amsaleg, J. Bailey, D. Barbe, S. Erfani, M. E. Houle, V. Nguyen, and M. Radovanovi\u00b4c, \u201cThe\nvulnerability of learning to adversarial perturbation increases with intrinsic dimensionality,\u201d in\nInformation Forensics and Security (WIFS), 2017 IEEE Workshop on, pp. 1\u20136, IEEE, 2017.\n\n[9] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. N. R. Wijewickrema, M. E. Houle, G. Schoenebeck,\nD. Song, and J. Bailey, \u201cCharacterizing adversarial subspaces using local intrinsic dimensional-\nity,\u201d CoRR, vol. abs/1801.02613, 2018.\n\n[10] T. Yu, H. Long, and J. E. Hopcroft, \u201cCurvature-based comparison of two neural networks,\u201d\n\narXiv preprint arXiv:1801.06801, 2018.\n\n[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional\nneural networks,\u201d in Advances in neural information processing systems, pp. 1097\u20131105, 2012.\n\n[12] M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein, \u201cSvcca: Singular vector canonical\ncorrelation analysis for deep learning dynamics and interpretability,\u201d in Advances in Neural\nInformation Processing Systems, pp. 6078\u20136087, 2017.\n\n[13] A. Morcos, M. Raghu, and S. Bengio, \u201cInsights on representational similarity in neural networks\nwith canonical correlation,\u201d in Advances in Neural Information Processing Systems, pp. 5727\u2013\n5736, 2018.\n\n[14] D. G. Barrett, A. S. Morcos, and J. H. Macke, \u201cAnalyzing biological and arti\ufb01cial neural\nnetworks: challenges with opportunities for synergy?,\u201d Current opinion in neurobiology, vol. 55,\npp. 55\u201364, 2019.\n\n10\n\n\f[15] S. Gong, V. N. Boddeti, and A. K. Jain, \u201cOn the intrinsic dimensionality of image representa-\ntions,\u201d in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npp. 3987\u20133996, 2019.\n\n[16] X. Ma, Y. Wang, M. E. Houle, S. Zhou, S. M. Erfani, S.-T. Xia, S. Wijewickrema, and J. Bailey,\n\n\u201cDimensionality-driven learning with noisy labels,\u201d arXiv preprint arXiv:1806.02612, 2018.\n\n[17] E. Facco, M. d\u2019Errico, A. Rodriguez, and A. Laio, \u201cEstimating the intrinsic dimension of\ndatasets by a minimal neighborhood information,\u201d Scienti\ufb01c reports, vol. 7, no. 1, p. 12140,\n2017.\n\n[18] E. Levina and P. J. Bickel, \u201cMaximum likelihood estimation of intrinsic dimension,\u201d in Advances\n\nin neural information processing systems, pp. 777\u2013784, 2005.\n\n[19] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d CoRR,\n\nvol. abs/1512.03385, 2015.\n\n[20] K. Simonyan and A. Zisserman, \u201cVery deep convolutional networks for large-scale image\n\nrecognition,\u201d arXiv preprint arXiv:1409.1556, 2014.\n\n[21] S. Vascon, Y. Parin, E. Annavini, M. D\u2019Andola, D. Zoccolan, and M. Pelillo, \u201cCharacterization\nof visual object representations in rat primary visual cortex,\u201d in European Conference on\nComputer Vision, pp. 577\u2013586, Springer, 2018.\n\n[22] D. Granata and V. Carnevale, \u201cAccurate estimation of the intrinsic dimension using graph\ndistances: Unraveling the geometric complexity of datasets,\u201d Scienti\ufb01c reports, vol. 6, p. 31377,\n2016.\n\n[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, \u201cImageNet: A Large-Scale\n\nHierarchical Image Database,\u201d in CVPR09, 2009.\n\n[24] A. Achille and S. Soatto, \u201cEmergence of invariance and disentanglement in deep representations,\u201d\n\nThe Journal of Machine Learning Research, vol. 19, no. 1, pp. 1947\u20131980, 2018.\n\n[25] C. Stringer, M. Pachitariu, N. Steinmetz, M. Carandini, and K. D. Harris, \u201cHigh-dimensional\n\ngeometry of population responses in visual cortex,\u201d bioRxiv, p. 374090, 2018.\n\n[26] S. Tafazoli, H. Safaai, G. De Franceschi, F. B. Rosselli, W. Vanzella, M. Riggi, F. Buffolo,\nS. Panzeri, and D. Zoccolan, \u201cEmergence of transformation-tolerant representations of visual\nobjects in rat lateral extrastriate cortex,\u201d Elife, vol. 6, p. e22794, 2017.\n\n[27] J. J. DiCarlo, D. Zoccolan, and N. C. Rust, \u201cHow does the brain solve visual object recognition?,\u201d\n\nNeuron, vol. 73, no. 3, pp. 415\u2013434, 2012.\n\n[28] G. Matteucci, R. B. Marotti, M. Riggi, F. B. Rosselli, and D. Zoccolan, \u201cNonlinear processing\nof shape information in rat lateral extrastriate cortex,\u201d Journal of Neuroscience, pp. 1938\u201318,\n2019.\n\n[29] J. J. DiCarlo and D. D. Cox, \u201cUntangling invariant object recognition,\u201d Trends in cognitive\n\nsciences, vol. 11, no. 8, pp. 333\u2013341, 2007.\n\n[30] B. A. Olshausen and D. J. Field, \u201cSparse coding of sensory inputs,\u201d Current opinion in neurobi-\n\nology, vol. 14, no. 4, pp. 481\u2013487, 2004.\n\n[31] R. Basri and D. Jacobs, \u201cEf\ufb01cient representation of low-dimensional manifolds using deep\n\nnetworks,\u201d arXiv preprint arXiv:1602.04723, 2016.\n\n[32] S. Chung, D. D. Lee, and H. Sompolinsky, \u201cClassi\ufb01cation and geometry of general perceptual\n\nmanifolds,\u201d Physical Review X, vol. 8, no. 3, p. 031003, 2018.\n\n[33] P. P. Brahma, D. Wu, and Y. She, \u201cWhy deep learning works: A manifold disentanglement\nperspective.,\u201d IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 10, pp. 1997\u20132008, 2016.\n\n[34] E. P. Simoncelli and B. A. Olshausen, \u201cNatural image statistics and neural representation,\u201d\n\nAnnual review of neuroscience, vol. 24, no. 1, pp. 1193\u20131216, 2001.\n\n11\n\n\f[35] S. Chung, D. D. Lee, and H. Sompolinsky, \u201cLinear readout of object manifolds,\u201d Physical\n\nReview E, vol. 93, no. 6, p. 060301, 2016.\n\n[36] R. Shwartz-Ziv and N. Tishby, \u201cOpening the black box of deep neural networks via information,\u201d\n\narXiv preprint arXiv:1703.00810, 2017.\n\n[37] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer, \u201cAutomatic differentiation in pytorch,\u201d in NIPS-W, 2017.\n\n[38] Y. LeCun and C. Cortes, \u201cMNIST handwritten digit database,\u201d 2010.\n\n12\n\n\f", "award": [], "sourceid": 3292, "authors": [{"given_name": "Alessio", "family_name": "Ansuini", "institution": "International School for Advanced Studies (SISSA)"}, {"given_name": "Alessandro", "family_name": "Laio", "institution": "International School for Advanced Studies (SISSA)"}, {"given_name": "Jakob", "family_name": "Macke", "institution": "Technical University of Munich, Munich, Germany"}, {"given_name": "Davide", "family_name": "Zoccolan", "institution": "Visual Neuroscience Lab, International School for Advanced Studies (SISSA)"}]}