__ Summary and Contributions__: The paper presents theoretical and empirical results advocating for the adaptive
group lasso as a feature selection technique for neural networks.

__ Strengths__: The technique provides a theoretical basis for using the adaptive group lasso to
select features in regression environments under moderate assumptions.

__ Weaknesses__: The assumption that the data is positive and completely continuous on it's
domain is quite strong, and likely to be broken for most practical datasets. Can
the authors comment on the likely behaviour of GL+AGL when this assumption is
broken? (Presumably it's broken in Boston Housing, but it's difficult to see any
effect given the paucity of results).
There is no discussion on the predictive performance of a network trained using
this regularisation penalty. The results only show the feature set recovery, but
it's conceivable that the net might recover the feature set but predict the
output poorly. Without empirical results it's difficult to say.
All the neural nets are trained for large numbers of epochs, with no indication
on the performance under early stopping, or other techniques to reduce the
computational load. Thus it's hard to know if the feature set becomes fixed
early in training, or if it's only later in training after the model has fit the
training data that the feature set is selected. Without this information (or at
least some guidance) it's hard to apply the technique to new problems where it
may not be feasible to train for so many epochs.
Similarly to the point above, the datasets have at most 50 features. Many real
world problems have many more features (especially where feature selection is
used), and it's unclear how the technique would scale to those larger problems.

__ Correctness__: I did not verify the proofs in detail. The empirical methodology seems fine, though
the experiments have insufficient breadth.

__ Clarity__: Given the technique is gradient based I'd expected some discussion of when
features are disabled and if they can be re-enabled (though no gradient would
flow to them when they output zero). This can occur in the lasso, and it seems
precluded here by the use of SGD.

__ Relation to Prior Work__: Lots of relevant literature is cited, but it's not related in the text to the
work at hand, nor is there much discussion of the relationship between this work
and other works, beyond the GL+AGL 2020 paper.

__ Reproducibility__: Yes

__ Additional Feedback__: The author response was limited to 1 page, while a 5 page response was submitted, and so to be fair to other submissions we've decided to ignore the response when considering the decision.

__ Summary and Contributions__: The paper aims to establish variable selection consistency when an analytic deep neural network model is imposed. In particular, the authors focus on the case when the dimension p is fixed and sample size n goes to infinity. An important assumption made in the paper is the strict positivity of the density over the domain. The authors show that the Adaptive Group Lasso selection procedure is selection-consistent. The general framework of the proof follows that for the analysis of high-dimensional linear models. But the authors also introduce several new technical elements.

__ Strengths__: The authors provide a rigorous characterization of feature selection consistency. Under this definition, with fixed p and strict positive densities, the model identifiability issue can be circumvented. They successfully show variable selection consistency for a rich class of deep neural network models. The results make the deep neural networks model more interpretable.

__ Weaknesses__: My concerns mainly lie in the following two perspectives:
1. Is the model specified in Assumption 2.1 a good one for data generating process? And are the Definition 2.2 and 2.3 appropriate for characterizing feature selection consistency? The authors may argue that these definitions are natural extensions of the concepts for linear models to the deep neural networks case. But my concern is it is well-known that the linear model is quite robust. Even the true date generating process may not be exactly characterized by a linear model, the variable selection results based on a linear model can still help us draw useful conclusions. However, for deep neural networks, the nonlinearity and complicated design may make the model less robust. If the underlying distribution differ from the model in Assumption 2.1, which is probably the case in most applications, then to which level we can trust the variable selection result?
2. To apply the result of Theorem 3.6, given a data set, how should we choose the tuning parameters? Do \gamma and \nu depend on the true model?

__ Correctness__: I did not check the details of the proof. But it seems that the outline of the proof is correct.

__ Clarity__: The paper is well written, and the proofs are also easy to follow. A typo I notice is in line 148, "be setting" should be "by setting".

__ Relation to Prior Work__: The author provide a detailed review of the relevant literature.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This paper studies the feature selection consistency for a wide class of deep neural network for (adaptive) group Lasso. This paper also shows the advantage of adaptive group lasso over naive group lasso by showing the adaptive group lasso is able to give selection consistency with less regularity condition on the inputs.

__ Strengths__: The main advantage of the paper is that their analysis is applicable to a wide class of neural network.
The use of Lojasewicz's inequality to remove the regularity on the Hessian matrix seems new and interesting.

__ Weaknesses__: This paper do not consider the feature selection under high dimensional setting (i.e., the number of feature grows as number of data points grows). And it is unclear that whether the analysis in this paper can be generalized to this case.
This paper claims that previous work on feature selection on neural network mainly focus on prediction aspect rather than selection consistency. However, there are many previous works already gives selection consistency analysis. Or the previous prediction result actually implies the feature selection consistency.
One key advantage that adaptive group lasso outperforms lasso seems not new and surprising.
****** Updated ******
Thanks authors for the rebuttal. However, I still feel the step from excessive risk to feature selection consistency is not that complicated. And the I believe author should discuss more on the technical advantage of the proposed framework.

__ Correctness__: I believe the correctness of the paper.

__ Clarity__: In general the paper is well written and easy to read.

__ Relation to Prior Work__: The author claims that most previous papers aims at excessive risk bound while this paper focus on selection consistency. However, there are various previous works actually give selection consistency and I believe the author should compare the theoretical analysis with those work.

__ Reproducibility__: Yes

__ Additional Feedback__: Below please find my detailed comments:
1. This paper do not consider the feature selection under high dimensional setting. This paper consider an easier problem setting in which the total number of feature is fixed and does not scale with the number of data points. It is important to study a feature selection method under high dimensional setting, where the number of total feature may grows exponentially fast w.r.t. data points.
2. This paper focus on the feature selection consistency and claim that most previous work mainly focus on excessive error bound because it is hard to address the nonidentifiability issue in neural network. However, there are various previous actually already gives selection consistency result, for example [1,2,3]. And the selection consistency in [4, 5] can be easily obtained with several simple arguments built upon their analysis.
I believe it is necessary for authors to do more extensive literature reviews and give more detailed discussion comparing their method and the existing works.
[1] Bayesian neural networks for selection of drug sensitive genes
[2] Variable Selection via Penalized Neural Network: a Drop-Out-One Loss Approach
[3] Posterior Concentration for Sparse Deep Learning
[4] Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification
[5] Variable Selection with Rigorous Uncertainty Quantification using Deep Bayesian Neural Networks: Posterior Concentration and Bernstein-von Mises Phenomenon
3. I was also wondering about the technical contribution of the proposed analysis framework. How does it make it easier to analyze the deep network compared with the previous method. I think it would be good if the author could discuss the advantage of the proposed framework: what this can do but previous work can not do?

__ Summary and Contributions__: This paper presents theoretical results on feature selection consistency for deep neural networks with adaptive group lasso penalty. The authors do so by circumventing the need to account for unidentifiability in deep neural networks, which has been the source of major difficulty in analyzing these methods.

__ Strengths__: This paper generalizes Dinh and Ho (2020) by (i) establishing a way to study the behavior of Group Lasso without a full characterization of the optimal parameter set and (ii) using Lojasewicvz’s inequality to upper-bound the distance between an estimated and an optimal parameters. By doing so, this paper shows that the analyses of feature selection consistency in Dinh and Ho (2020) apply not only to an irreducible feed-forward network with a single hidden layer but also to a broader range of deep analytic neural networks such as feed-forward networks with multiple hidden layers and convolutional neural networks.
Also, the empirical illustration of their analyses seems interesting, especially the performance gap between feature selection with AGL + GL and with GL only. In experiments, the authors use deep neural networks to show that their analyses apply to neural networks with multiple hidden layers.

__ Weaknesses__: The experimental results are insufficient. The simulation was done on a single architecture with only 50 input features, and the single real data had only 13 input features. Both the simulated and real data had a small number of input features, which seems inadequate for a study of feature selection problem. Considering that the claim of the paper applies to a broad class of neural networks, the experiments should include those deep neural networks from difference classes.

__ Correctness__: Yes.

__ Clarity__: In general, this paper is easy to follow.
Minor commnets:
In Definition 2.3 on page 3, I assume n is the number of samples. It’d be good to define this.
On line 159, I assume d is the distance between two parameter sets. It’d be good to define this.

__ Relation to Prior Work__: Yes. In Introduction, the authors mention Dinh and Ho (2020) on which this paper seems to be based and clearly stated the difference of this paper and Dinh and Ho (2020).

__ Reproducibility__: Yes

__ Additional Feedback__: ***After author feedback***
Since the authors did not follow the formatting guideline of 1 page for the author feedback, the reviewers were advised not to take into account the author response.