NeurIPS 2020

Consistent feature selection for analytic deep neural networks

Review 1

Summary and Contributions: The paper presents theoretical and empirical results advocating for the adaptive group lasso as a feature selection technique for neural networks.

Strengths: The technique provides a theoretical basis for using the adaptive group lasso to select features in regression environments under moderate assumptions.

Weaknesses: The assumption that the data is positive and completely continuous on it's domain is quite strong, and likely to be broken for most practical datasets. Can the authors comment on the likely behaviour of GL+AGL when this assumption is broken? (Presumably it's broken in Boston Housing, but it's difficult to see any effect given the paucity of results). There is no discussion on the predictive performance of a network trained using this regularisation penalty. The results only show the feature set recovery, but it's conceivable that the net might recover the feature set but predict the output poorly. Without empirical results it's difficult to say. All the neural nets are trained for large numbers of epochs, with no indication on the performance under early stopping, or other techniques to reduce the computational load. Thus it's hard to know if the feature set becomes fixed early in training, or if it's only later in training after the model has fit the training data that the feature set is selected. Without this information (or at least some guidance) it's hard to apply the technique to new problems where it may not be feasible to train for so many epochs. Similarly to the point above, the datasets have at most 50 features. Many real world problems have many more features (especially where feature selection is used), and it's unclear how the technique would scale to those larger problems.

Correctness: I did not verify the proofs in detail. The empirical methodology seems fine, though the experiments have insufficient breadth.

Clarity: Given the technique is gradient based I'd expected some discussion of when features are disabled and if they can be re-enabled (though no gradient would flow to them when they output zero). This can occur in the lasso, and it seems precluded here by the use of SGD.

Relation to Prior Work: Lots of relevant literature is cited, but it's not related in the text to the work at hand, nor is there much discussion of the relationship between this work and other works, beyond the GL+AGL 2020 paper.

Reproducibility: Yes

Additional Feedback: The author response was limited to 1 page, while a 5 page response was submitted, and so to be fair to other submissions we've decided to ignore the response when considering the decision.

Review 2

Summary and Contributions: The paper aims to establish variable selection consistency when an analytic deep neural network model is imposed. In particular, the authors focus on the case when the dimension p is fixed and sample size n goes to infinity. An important assumption made in the paper is the strict positivity of the density over the domain. The authors show that the Adaptive Group Lasso selection procedure is selection-consistent. The general framework of the proof follows that for the analysis of high-dimensional linear models. But the authors also introduce several new technical elements.

Strengths: The authors provide a rigorous characterization of feature selection consistency. Under this definition, with fixed p and strict positive densities, the model identifiability issue can be circumvented. They successfully show variable selection consistency for a rich class of deep neural network models. The results make the deep neural networks model more interpretable.

Weaknesses: My concerns mainly lie in the following two perspectives: 1. Is the model specified in Assumption 2.1 a good one for data generating process? And are the Definition 2.2 and 2.3 appropriate for characterizing feature selection consistency? The authors may argue that these definitions are natural extensions of the concepts for linear models to the deep neural networks case. But my concern is it is well-known that the linear model is quite robust. Even the true date generating process may not be exactly characterized by a linear model, the variable selection results based on a linear model can still help us draw useful conclusions. However, for deep neural networks, the nonlinearity and complicated design may make the model less robust. If the underlying distribution differ from the model in Assumption 2.1, which is probably the case in most applications, then to which level we can trust the variable selection result? 2. To apply the result of Theorem 3.6, given a data set, how should we choose the tuning parameters? Do \gamma and \nu depend on the true model?

Correctness: I did not check the details of the proof. But it seems that the outline of the proof is correct.

Clarity: The paper is well written, and the proofs are also easy to follow. A typo I notice is in line 148, "be setting" should be "by setting".

Relation to Prior Work: The author provide a detailed review of the relevant literature.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper studies the feature selection consistency for a wide class of deep neural network for (adaptive) group Lasso. This paper also shows the advantage of adaptive group lasso over naive group lasso by showing the adaptive group lasso is able to give selection consistency with less regularity condition on the inputs.

Strengths: The main advantage of the paper is that their analysis is applicable to a wide class of neural network. The use of Lojasewicz's inequality to remove the regularity on the Hessian matrix seems new and interesting.

Weaknesses: This paper do not consider the feature selection under high dimensional setting (i.e., the number of feature grows as number of data points grows). And it is unclear that whether the analysis in this paper can be generalized to this case. This paper claims that previous work on feature selection on neural network mainly focus on prediction aspect rather than selection consistency. However, there are many previous works already gives selection consistency analysis. Or the previous prediction result actually implies the feature selection consistency. One key advantage that adaptive group lasso outperforms lasso seems not new and surprising. ****** Updated ****** Thanks authors for the rebuttal. However, I still feel the step from excessive risk to feature selection consistency is not that complicated. And the I believe author should discuss more on the technical advantage of the proposed framework.

Correctness: I believe the correctness of the paper.

Clarity: In general the paper is well written and easy to read.

Relation to Prior Work: The author claims that most previous papers aims at excessive risk bound while this paper focus on selection consistency. However, there are various previous works actually give selection consistency and I believe the author should compare the theoretical analysis with those work.

Reproducibility: Yes

Additional Feedback: Below please find my detailed comments: 1. This paper do not consider the feature selection under high dimensional setting. This paper consider an easier problem setting in which the total number of feature is fixed and does not scale with the number of data points. It is important to study a feature selection method under high dimensional setting, where the number of total feature may grows exponentially fast w.r.t. data points. 2. This paper focus on the feature selection consistency and claim that most previous work mainly focus on excessive error bound because it is hard to address the nonidentifiability issue in neural network. However, there are various previous actually already gives selection consistency result, for example [1,2,3]. And the selection consistency in [4, 5] can be easily obtained with several simple arguments built upon their analysis. I believe it is necessary for authors to do more extensive literature reviews and give more detailed discussion comparing their method and the existing works. [1] Bayesian neural networks for selection of drug sensitive genes [2] Variable Selection via Penalized Neural Network: a Drop-Out-One Loss Approach [3] Posterior Concentration for Sparse Deep Learning [4] Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification [5] Variable Selection with Rigorous Uncertainty Quantification using Deep Bayesian Neural Networks: Posterior Concentration and Bernstein-von Mises Phenomenon 3. I was also wondering about the technical contribution of the proposed analysis framework. How does it make it easier to analyze the deep network compared with the previous method. I think it would be good if the author could discuss the advantage of the proposed framework: what this can do but previous work can not do?

Review 4

Summary and Contributions: This paper presents theoretical results on feature selection consistency for deep neural networks with adaptive group lasso penalty. The authors do so by circumventing the need to account for unidentifiability in deep neural networks, which has been the source of major difficulty in analyzing these methods.

Strengths: This paper generalizes Dinh and Ho (2020) by (i) establishing a way to study the behavior of Group Lasso without a full characterization of the optimal parameter set and (ii) using Lojasewicvz’s inequality to upper-bound the distance between an estimated and an optimal parameters. By doing so, this paper shows that the analyses of feature selection consistency in Dinh and Ho (2020) apply not only to an irreducible feed-forward network with a single hidden layer but also to a broader range of deep analytic neural networks such as feed-forward networks with multiple hidden layers and convolutional neural networks. Also, the empirical illustration of their analyses seems interesting, especially the performance gap between feature selection with AGL + GL and with GL only. In experiments, the authors use deep neural networks to show that their analyses apply to neural networks with multiple hidden layers.

Weaknesses: The experimental results are insufficient. The simulation was done on a single architecture with only 50 input features, and the single real data had only 13 input features. Both the simulated and real data had a small number of input features, which seems inadequate for a study of feature selection problem. Considering that the claim of the paper applies to a broad class of neural networks, the experiments should include those deep neural networks from difference classes.

Correctness: Yes.

Clarity: In general, this paper is easy to follow. Minor commnets: In Definition 2.3 on page 3, I assume n is the number of samples. It’d be good to define this. On line 159, I assume d is the distance between two parameter sets. It’d be good to define this.

Relation to Prior Work: Yes. In Introduction, the authors mention Dinh and Ho (2020) on which this paper seems to be based and clearly stated the difference of this paper and Dinh and Ho (2020).

Reproducibility: Yes

Additional Feedback: ***After author feedback*** Since the authors did not follow the formatting guideline of 1 page for the author feedback, the reviewers were advised not to take into account the author response.