__ Summary and Contributions__: I have read the authors rebuttal and others reviews.
I maintain my score.
______
This article proposes a variable selection procedure that takes into account both the labels and the interactions with other features. To do so, the authors formally define notions of "redundancy" by means of both mutual information (MI) within set of features (to measure representativeness of a given variable) and MI wrt labels (that measures prediction impact). The group of feature are later selected by maximizing a (balanced) sum of this two quantities.

__ Strengths__: The methods are rigorously presented with clear and formal definitions. The experimental evaluations show that the proposed Group Interpreter algorithm is competitive. However it is not uniformly better than old style feature selection algorithm like Lasso (see Table 2)
The overall contribution is solid and transparent.

__ Weaknesses__: The overall architecture is quite complex and leads to non convex optimization problem for approximating the exact formulation in Eq 3. This introduces computational complexity, but also lack of guarantee in the solutions obtained in practice (local minima etc ...)
The paper does not present any statistical guarantee.

__ Correctness__: I did not check all the proof's details.

__ Clarity__: The paper is clearly written. The architecture presented in section 2 is quite complex and the description lacks intuition. For instance the choice of auto-encoder and the discussions here is quite hard to follow.

__ Relation to Prior Work__: Fairly discussed with various comparisons.

__ Reproducibility__: Yes

__ Additional Feedback__:
It should be interesting to display the computational overhead of the methods being compared. In some settings, they achieve fairly close performances.

__ Summary and Contributions__: This paper describes a framework to discover groups of related features and identify relevant groups for predicting labels on a per-sample basis. The goal is to have an interpretable latent factor model that can be used to better understand datasets with naturally modular structure, like gene expression. The technical approach is based on a customized autoencoder style architecture optimized with an information-theoretic objective. Experiments on real and synthetic data and comparisons with several comparable methods validate the approach.

__ Strengths__: The theoretical grounding was clear and easy to follow. Empirically evaluating unsupervised learning when the goal is to find interpretable structure is difficult, but I thought the paper did a good job of including a mix of synthetic and real-world experiments, and reasonable baselines for comparison. Discovering interpretable information in an unsupervised way, especially in biomedical data like gene expression, is an important problem and this seems like a good contribution in that direction. The approach is general enough to be of wide interest to the NeurIPS community.

__ Weaknesses__: - Validating the approach on gene expression
I really wanted to see that the feature grouping per instance provided something novel for gene expression. That is, I would like to see that for smokers, we get genes A,B,C,D are all correlated (in a group for these instances), but for non-smokers, A,B,C,D are uncorrelated (not grouped for these instances). I don't think that's whats shown in Fig. 5,6, where we see that the expression levels for A,B,C,D (top rows) are low for smokers and high for non-smokers. Plots like Fig. 5,6 can be made with probably any of the global feature clustering methods. Also, there was no attempt to interpret the resulting clusters, perhaps by comparing enrichment of terms from the Gene Ontology database (though it is tricky for NeurIPS, where people won't be familiar with the domain, it still should be possible to define some sort of sensible validation of the groups besides the visual plot which is not very strong justification).
- A limitation of the approach
The construction of the groups relies on essentially unweighted averaging of features in a group, so the features should ideally by positively correlated, with similar correlations. In gene expression, however, you often see a mix of correlated and anti-correlated genes that should arguably be in the same group.

__ Correctness__: Yes, the theoretical development and empirical approach were appropriate and correct.

__ Clarity__: The paper was well presented and easy to read. The authors' version of the new "broader impacts" section was the most eloquent I've read so far.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: A few other minor comments:
Other methods to compare?
For gene expression experiments, I've found k-means clustering using R^2 on columns to be a surprisingly good baseline for global feature clustering (and it allows for getting groups that are anti-correlated). I've also seen methods for finding groups of variables with high multivariate mutual information / total correlation / redundancy. Those ideas could be interesting for comparison or for improving the representation redundancy part to allow for groups that are not always positively correlated.
Attention-based learning methods are natural comparisons for things like Fig. 7. Those methods really only capture the "relevant redundancy", but it might be nice to show explicitly examples where they fail to distinguish two distinct groups of relevant features. (or maybe they would tend to ignore some correlated features that are grouped together in your approach).
- Isn't there a more direct ref for the Gumbel Softmax trick?
- Sample complexity
Can you say much about the sample complexity? The grouping mechanism is probably quite sample efficient, as it gives a strong prior.
- Other real-world applications
Doing MNIST and FMNIST probably makes sense as a second real-world applications as these are familiar for NeurIPS, but I think other domains might have been more appropriate.
It would be fun to look at personality survey data. There, psychologists assume that questions/features are always globally correlated, but you might find that there are people who interpret the questions differently somehow.
Neuroimaging also has modular structure - though it might be too high-d for your SVD step.
Text documents might also be interesting. Perhaps you could disambiguate words.
Finally, if you are looking for more large-scale empirical tests, openML has a lot of gene expression data.
EDIT: I have read the author feedback.
Even though you compare 9 methods, I think that the evaluation of *only* test accuracy is not sufficient. Since the method purports to find clusters of genes that are relevant for prediction, there should be some way to measure to what extent those clusters reflect meaningful structure. Fig. 7 gives a qualitative view of this for MNIST and Table 3 measure for the synthetic case, but I think the work would have more impact if it could demonstrate the value of the learned structure in a challenging setting like gene expression.
That's also what I was looking for in Fig. 5, which I realize that I misinterpreted. Scrutinizing it now, I can see how you find clusters of genes, for each *individual*, which are relevant. Your sorting identifies the largest cluster relevant for smoking, non-smoking respectively. But again, I don't see any way of knowing whether these groups are meaningful or useful. There are a lot of "feature importance" methods applied to neural networks that produce importance of feature per sample like you do, but often the identified features are not particularly meaningful (e.g. discussed here: https://www.aaai.org/ojs/index.php/AAAI/article/view/4252).

__ Summary and Contributions__: This paper considers the problem of group feature extraction. The analysis is based on the group mutual information (gI). By using gI, the feature redundancy and relevant redundancies are defined. The formulation of so called "Instance-wise Feature Grouping" is thus constructed to minimize the redundancy.
One main challenge of using mutual information is the difficulty of optimization, and the authors used the variational lower bound to approximate it.
Another contribution is using the Gumbel-softmax layer for the index sampling.
Results on both synthetic datasets and real gene datasets are provided to show the performance of the proposed model and algorithm.

__ Strengths__: The main contribution to me is the problem formulation based on group mutual information, i.e., definitions 1 and 2 and Theorems 1 and 2.
+) Optimization using the variational lower bound [11]
+)Gumbel-soft-max to enable sampling
+) Optimize the number of groups.
+) Real data results on gene data

__ Weaknesses__: I like the framework based on mutual information. The formulation is solid but actually not novel.
-) Theorem 3 is weak and might not be useful
-) The results are somehow confused. More analysis need to be given on why Lasso sometimes performs better
-) Instead of plots in Figure 3, 5 and 6, some analysis on the groups of genes might be helpful.
The information of these plots are low.
-) Concern about the proof of theorem 1 & 2. I am confused starting from Eq. 36 in appendix. As claimed in your proof, the mutual information I(Xhat,X) is upper bounded by the entropy of X, i.e., I(x,x) = H(x). The deviation starting from Eq. 36 is plausible iif the upper bound can be reached. My main concern is that whether the upper bound is actually reachable? Since the characteristic features are the compressed version of the original features and the reconstruction will also include inversible operations (e.g., ReLU), I am a bit worried about the correctness of the proof and I am afraid whether these two theorems actually contribute to the method developing. More explanation is appreciated.
-) I feel a bit confused about the MI lower bound estimation. The authors first say the samples can be generated by using Eq. 7&8 while the relationship with (P(X|Xhat), P(Y|Xbar)) is not clearly stated. However, the author also says the two distribution are unknown (P(X|Xhat), P(Y|Xbar)) and thus use another method. As such, why do you include Eq. 7&8 and the corresponding explanation.

__ Correctness__: Most of the claims and method are correct and solid.
The empirical methodology is correct but some improvement is necessary on the gene data analysis

__ Clarity__: Yes. The paper is easy to follow.

__ Relation to Prior Work__: The paper is based on mutual information. Probably some of the literature review on MI is needed.

__ Reproducibility__: Yes

__ Additional Feedback__: 1) Literature review on mutual information for feature extraction
2) More analysis on gene data
3) States clearly the difference of this work with previous group feature extraction.
4) More details in the proof
---
Post Rebuttal:
I have read the rebuttal and it has addressed some of my concerns on the theorem and mutual information. Therefore, I changed my score.

__ Summary and Contributions__: Propose an information theoretical based method to perform instance-wise feature selection, which essentially to select features that are relevant/aligned with labels as well as not been redundant.

__ Strengths__: * The contribution appears novel and significant towards the area of feature selection and explanable AI.
* The proposed method is well motivated and well designed. Authors also propose bounds and optimising on those to make the model fitting easier and faster.
* Experiments combine both synthetic and real world data sets. Synthetic datasets allows some experiments that can control the amount of redundancy. The real datasets of gene and MNIST demonstrates the application of the proposed work.

__ Weaknesses__: * I may have missed it and although dated, would be good to mention and compare with "older" measures like MRMR? See https://www.computer.org/csdl/trans/tp/2005/08/i1226-abs.html .
* How to stop each instance obtaining it group of features and overwhelm users in terms of cognitive attention.
* The model itself is relatively complex, it would be good if there was some ablation type of testing to show all components are necessary

__ Correctness__: The method and proofs appears correct, although I didn't check too thoroughly. Empirical methodology is reasonable.

__ Clarity__: The paper is generally well written. The main paper is self-contained and one can follow the arguments and definitions, lemmas etc.

__ Relation to Prior Work__: See comment about weaknesses, but generally it seems to discuss most of the related prior work.

__ Reproducibility__: No

__ Additional Feedback__: