NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3292
Title:Intrinsic dimension of data representations in deep neural networks

Reviewer 1

The paper examines the intrinsic dimension of representations learned by neural networks. This direction is a natural and worthwhile one, in order to understand the behaviour and complexity of DNNs from a geometric angle. The paper is reasonably clearly written overall. Treatment of related work is brief and not well connected to the proposed study (the main claim of novelty is that the study is more "systematic" compared to previous works, but its not clear what that actually means). Some important related work is missed. The technical contribution is low from a methods perspective, since the study is almost entirely empirical and doesn't develop new technques (the intrinsic dimension estimator that is used has been published previously). Of the various findings/studies performed, the most original is the investigation of the curvature of the representations (comparison to linear dimensionality). However, this alone isn't enough to lift the paper into the "high significance" category, since most of the conclusions made are not surprising in light of existing work. Other works have also made similar observations that the intrinsic dimension is much lower than the input dimension (see [C] below). [C] has already observed that the intrinsic dimension decreases during training (and the accuracy increases during training as intrinsic dimension decreases). [C] also studied the case of noisy data (permuted class labels) . Hence the findings of 3.2 and 3.5 have limited novelty in light of [C], which isn't referenced. The estimator: I recommend the authors emphasise that they are using a global estimator of intrinsic dimensionality, rather than a local estimator. i.e. they are assessing the dimension of a collection of points, rather than the (local) intrinsic dimension around a point (which is covered in [A] and [B] below, and these works should be cited for completeness). -reference [7] in the paper should probably instead have been the ICLR'18 paper by X. Ma et al. References [A] Estimating Local Intrinsic Dimensionality. Amsaleg et al. KDD'15 [B] Maximum Likelihood Estimation of Intrinsic Dimension. Levina et al. NIPS'04 [C] Dimensionality-Driven Learning with Noisy Labels. Ma et al. ICML'18

Reviewer 2

The authors use a recently proposed intrinsic dimensionality (ID) estimator to track the ID of representations across deep network layers and architectures. Overall, the results, text, and figures were very clear. The paper makes empirical observations that could be used to constrain theoretical models of computation. This method for estimating ID seems generally useful in representation learning research. Comments/questions Is there consistent variability across layers if you estimate ID per class? Fig. 3 shows the variability at each layer, but it would also be helpful to know whether the fluctuations are random or consistent across layers. In the section on classification performance, the correlation is shown across network architectures. Would you also find this correlation across networks with the same architecture but retrained with different random seeds? Does this correlation show up in the training error or training/testing loss? Is this true within one network during the course of training? It would be helpful to have a better understanding of where the ID variations are coming from. I would suggest avoiding red and green as category colors as they are not red-green colorblind friendly.

Reviewer 3

GENERAL originality, - this not a terribly original paper, but addresse and important problem (ID with a tool that has not been applied before.) quality, - High quality experimental work - many experiments and follow up experiments clarity, - Writing is clear. Research questions are clearly stated. Typical for conference papers, there is a bit of over-selling. significance,- The paper is a significant contribution. The authors report a quite universal shape of the ID vs DNN Depth. Seems like an important challenge for future analytic studies to explain the universal hunchback. weakness,- The main weakness is that we still remain to understand what generates the (low) ID - see below discussion. The (lower bound) comments on the TwoNN method (starting at line 98) are significant and could have been explore further. How well-defined are IDs anyway? In line100 you give the following hint about the physicality of IDs: "For real-world data, the intrinsic dimension always depends on the scale of distances on which the analysis is performed..." I agree: Presumably, for real data/objects there is a very large (infinite) set of continuous symmetry operations that all - when applied to a given object - will lead to an equivalent object/same class object. Hence the "real ID" is infinite - however, depending on noise and sample size/scale there can be an effective ID = the main invariances of the object. The authors speculate on the role of IDs in line151 "These results suggest that state-of-the-art deep neural networks – after an initial increase in ID – perform a progressive dimensionality reduction of the input feature vectors." This interpretation is consistent with theoretical analyses as in Achille, A. and Soatto, S., 2018. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1), pp.1947-1980. Here the DNN is shown to remove nuissances (symmetries/invariances) along the path from input to output The authors compute pr object/class IDs (e.g. line137) - what are the reasons they should be the same across objects? SPECIFIC line84: "This allows overcoming the problems related to..." These comments on the properties of the TwoNN method are important - but could have been addressed more thoroughly- In the TwoNN paper there is no analysis of spaces with dimensions beyond eD=5000 (isomap face images /MNIST -- not such a convincing analysis btw!). Most of the checks are made at ED < 100. Here we go to 10-100x higher dimensions cf. figure 2. Eq (1) assumes independence - in need of an argument at least... But how well defined is the notion of neighbors in DNN high dimensional spaces? c.f. Radovanović, M., Nanopoulos, A. and Ivanović, M., 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 11(Sep), pp.2487-2531. Please inspect the 2-neighbor matrix to check if there are the "hubs". In a later analysis (line188) you hint at this problem by using *normalized* covariance matrix equivalent to using cosine distance as recommended in the hubs paper.