Review for NeurIPS paper: Fourier-transform-based attribution priors improve the interpretability and stability of deep learning models for genomics

NeurIPS 2020

Fourier-transform-based attribution priors improve the interpretability and stability of deep learning models for genomics

Review 1

Summary and Contributions: The authors introduce a novel regularization technique based on Fourier transform of input gradients that encourages neural network models to emphasize contiguous segments in the input data. The authors developed this technique for predicting transcription factor binding from DNA sequences and seek to annotate functional motifs in the DNA. For such tasks, the authors demonstrate superior performance and stability with their regularizer.

Strengths: The Fourier regularization technique is clever and may have applicability for other ML problems, or at least inspire further work on training interpretable models. The authors use many different views and orthogonal data to argue for the benefit of their regularizer.

Weaknesses: The most cited methods in this space use a different pooling architecture and train multi-task for dozens of epochs. In contrast, the authors train single task and choose the model achieved after one or two epochs as best. These differences may contribute to the rapid overfitting and saliency variance. Do multi-task models trained for longer improve motif annotation? Does your method also improve motif annotation in that framework? Multi-task datasets can be obtained from both the Basset and DeepSea papers. Does the regularizer improve training using their model architectures? It’s argued that the method is generally applicable. However, it requires binary peak annotations because only positive examples apply the Fourier regularization. Does motif annotation suffer if the regularization is applied to all sequences? If so and if that is not recommended, then the authors should clarify that binary peaks are required. Why is a smoothing operation added before the Fourier operation? Does performance suffer without this? Several questions about Figure 2. First, how do you divide Fourier components into low and high frequency? Second, how do you compute entropy of the nucleotide attributions, which don't represent a probability distribution? Third, in panel C, the zoomed in range does not fit within the ostensibly larger range in the top browser view. In cases where the motif instance does not match the optimal kmer, some nucleotide(s) could be mutated to increase the prediction and you’d observe a negative attribution score sandwiched in between positive attribution scores. This regularization wouldn't seem to be as appropriate then. You might take a look at some examples like that and consider adding a supplementary figure to explore and discuss this scenario. ———————————————— I have read the other reviews, and the authors’ response. To my first point, the authors point out that many of their experiments are multi-task training data sets. Reviewing section 2.2, I see that’s true, but the multiple tasks always represent the same TF. This is quite different from the “massive multi-task“ framework employed by the most cited methods in the space. The authors declined to address their use of different architectures. The authors hypothesize that their models also overfit rapidly because the datasets are large, for example 700k examples for the DNase K562. I find this to be backwards; as the number of sequences in the data set grows, the influence of each single sequence on the model parameters decreases. Overfitting on the second epoch is strange. However, the authors have a lot of experience working with these models and data. If this is normal for them, then hopefully the comparisons are valid and generalizable. To my second point, the authors committed to clarifying this point in the text. To my third and fifth points, the authors clarified their method. They declined to perform an experiment without the smoothing operation, but their response helped me understand the value that it brings. To my fourth point, the authors clarified that they normalize the attributions to represent pseudo-probabilities. Based on this discussion, I am willing to increase my score for the paper to 5 and support its acceptance.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The authors propose a novel Fourier-based attribution prior to improve the interpretability of deep learning models of transcription. The methods show promise relative to the baseline, with improvements almost uniformly across the board in both the predictive accuracy on a test set as well as stability of the derived motifs.

Strengths: 1) The Fourier prior is well-motivated, and the resulting interpretability is also compared to an independent method, DeepSHAP, based on Shapely values. 2) The results show robustness of the predictions and the method's ability to recover known transcription factors such as CLOCK. 3) The authors analyze multiple facets of their methodology on the same 4 datasets, demonstrating an improvement on almost all of them.

Weaknesses: 1) It is unclear how much better these methods perform than the "old-school" motif prediction methods. 2) The only comparison is between models with the prior and models without the prior, there are no other methods being compared to. 3) The authors do not seem to have a good explanation for the situations in which their method performs less well than the baseline.

Correctness: The methodology appears to be correct to the extent that I am able to ascertain it.

Clarity: The paper is clearly written and coherently argued.

Relation to Prior Work: Only similar deep learning models are discussed; other interpretable, but not "deep", models are not discussed at all.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The authors propose a novel attribution prior for training deep learning models in genomics. At training time they obtain attribution scores through input gradients and regularize the model by penalizing high-frequency components of their Fourier components. Throughout extensive empirical experiments, the authors show that the proposed attribution prior dramatically improve models’ stability, interpretability, and generalization performance.

Strengths: (1) Novelty and Significance The proposed Fourier-based attribution prior is novel and effective. It stems from the intuitive motivation that attribution scores should be focused on low-frequency reasonable motifs. To the best of my knowledge, this is the first work to propose Fourier-based priors and also one of the first works to use attribution priors to regularize deep learning models in genomics. The extensive empirical experiments are impressive and show that it would benefit wide NeurIPS and Bioinformatics community. (2) Reproducibility While the main manuscript focuses on presenting the quantitative and qualitative experiment results, the authors also provide detailed information and codes to reproduce the experiments through supplementary materials. I highly appreciate the authors’ effort in providing all the information to reproduce the results.

Weaknesses: I did not find any major limitations of the work, but I would also like to see some more experiments and authors' opinions on the following issues. (1) There is a discrepancy of attribution methods used during training and evaluation. Considering that different attribution methods often produce different scores, why would penalizing input gradients improve the SHAP scores? Does the proposed method also improve the interpretability results in terms of the input gradients? How much different attribution methods (used either training or evaluation) would affect the proposed method? (2) Although there are some previous works that use attribution priors to regularize deep learning models, the manuscript does not cover them. Please include related works in the manuscript. (3) While the authors qualitatively show that the attribution priors improve the detection of motifs, can you provide quantitative results compared to ground-truth motifs? I am not sure, but the TomTom tool might help comparing obtained motifs with ground-truth motifs.

Correctness: The claims and methods seem correct, but I would like to recommend more ablation studies on the encoding scheme.

Clarity: The paper is generally well written and easy to understand. The contribution is clear, and the way the authors claim is well understandable and notations are explained well.

Relation to Prior Work: No. I think the authors should properly cite and explain related works that use attribution priors to regularize deep learning models.

Reproducibility: Yes

Additional Feedback: No additional comments. ---- Post Author Feedback Comments --- I have read the Author Feedback and I'm happy that the authors have clarified some points.

Review 4

Summary and Contributions: The stated contribution of the paper is an attribution prior penalizing high-frequency components of Fourier spectrum of per-position attribution scores that the authors claim improves interpretability and stability of these models. The motivation of the paper is driven by challenges encountered from applying deep learning models to nucleotide sequences to predict measurements from functional genomics (i.e. transcription factor binding and chromatin accessibility). These models have improved prediction performance but are difficult to interpret, leading to attribution scores as one approach to interrogate the model and inputs for sequence motifs (and coordination between motifs) that drive individual predictions. However, the challenge is that attribution scores can be noisy and sensitive to random initializations of the same model. The manuscript empirically demonstrates that the addition of their per-position attribution prior to the training of their models of transcription factor binding (binary outcomes or TF profiles) improves not only detection of motifs within individual models but stability of detection of these motifs between random initializations of the model.

Strengths: The authors nicely demonstrate improvements in detection and stability of motifs using their attribution prior. Quantifying improvements in attribution can be challenging, but the authors provide several metrics to this point and compelling visual examples. The empirical analysis is thorough. Using the frequency spectrum of the attributions as a proxy for localization/noisiness for penalization is a nice contribution to bring in to this area from other fields that consider high frequency signals as noise The manuscript demonstrates that their attribution prior does not decrease performance on the original predictions of interest

Weaknesses: 1. The paper did not compare to related works (citations [9] and [10]) attribution priors, only added L2 regularization * In particular, [9] describes an image-classification prior which penalizes total variation between adjacent pixels. This implicitly penalizes high-frequency modes of the attribution vector (image in this case) and is closely related to total variation image denoising (vs the manuscript’s method’s relationship to low-pass filters). A comparison to some of the priors in these previous works (adapted to 1D sequences) would have been appropriate. * The impact of other forms of regularization (dropout, L1, etc.) would have also been nice to see 2. It would have been helpful to directly clarify in the description of the methods if/when other forms of regularization (L2, L1, dropout) were being used and/or specifically if there was no regularization being used to prevent overfitting by default (No prior model in tables 1-4). Since the paper claims in Tables 1-4 that in addition to improved interpretability, the fourier transform attribution prior offers performance improvement, it becomes important to specify if the model with the Fourier transform prior has better/similar/worse results when compared to the default model with some sort of commonly used regularization (L2, L1, dropout). They should show that the performance doesn’t degrade when their prior is compared to a default model with these other regularizations (L2, L1, dropout). 3. The supplemental code was missing configuration files or driver scripts to indicate which parameters/models were used for which figures. The train model scripts had default parameters specified using the Sacred library/framework, and they and several of the notebooks were dependent on hardcoded paths and files that do not exist in the code (or as targets for downloading by other scripts, at least as far as we could assess). 4. Details about what was being computed and when were sometimes sparse. For example, the paper quantifies differences in Shannon entropy of the attributions. How is this being computed (i.e. what is the probability distribution being used/inferred)? 5. The attribution method g must be differentiable almost everywhere in order to permit optimization of the attribution loss using gradient descent. This limitation should be clearly stated although all evaluations in the paper used a input feature-scaled gradient for g. 6. The manuscript sets a frequency threshold for penalization in relationship to a motif length of 7. This is specific to the TF prediction task. Other predictions from nucleotide sequences (i.e. RNA-binding protein predictions, etc.) may involve sequence motifs that are traditionally considered to be more degenerate and short. We suspect that the approach might have difficulty penalizing high frequency components without impacting predictions, and we suggest that the potential limitations for detecting real but short regulatory regions be discussed when considering its broader impact. 7. Tables 3 and 4 show worse performance in auPRC for Nanog/Oct4/Sox2. * It is unclear why the better performance for the “no prior” model was not bolded. * The setting of “more complex motif sequence syntax” is a reason DL in genomics is particularly interesting and it was disappointing to see lower performance in this setting. While the manuscript explains that other motifs that were not underlying a peak were also being highlighted, referring to the example in S20, further quantification of this phenomenon and additional discussion of how it could cause lower performance in this evaluation is warranted. ====== We have read the authors response and the other reviews. We thank the authors for their detailed response, especially with respect to our questions/comments about other choices for regularization. We felt that the original manuscript implied that the Fourier regularization penalty improved interpretability without decreasing predictive performance. However, the response indicates that the penalty decreases predictive performance compared to traditional regularization. The authors should explicitly note that it has worse predictive performance in these cases as part of ensuring they do not overstate their results. We take issue with the reasoning for leaving out a comparison between traditionally-regularized models with/without additional Fourier regularization penalty. Regularization of several competing penalties is very common in the literature. Difficulties in combining this penalty with other forms of regularization is also an observation that should be noted. Ultimately, these issues do not impact our overall score because the results on interpretability are the focus and still compelling, but we hope that the authors are more forthcoming about their method’s limitations in the final manuscript.

Correctness: We did not identify any errors and were satisfied with their empirical approach.

Clarity: Yes, the paper is well written.

Relation to Prior Work: While the manuscript cites two papers ([9], [10]) describing attribution priors, they are not discussed nor used as a standard for comparison. In particular, [9] describes a total variation attribution prior on adjacent pixels (i.e. positions) that would implicitly penalize high-frequency components of the attribution vector’s Fourier spectrum.

Reproducibility: No

Additional Feedback: 1. The work was reasonably reproducible, but there were issues with the supplemental code that were discussed in the weaknesses section. 2. “Dramatically” improves in abstract is uninformative, if not also overstated. In general, while the authors show consistent improvements using various metrics, in many cases those appear small and/or not statistically significant. Thus, toning down the associated claims/statements seems to be warranted. 3. Implicit periodicity of the Fourier transform over the length of the input window could potentially lead to weird artifacts in penalties when attribution signal is present at both ends of the attribution vector (which Fourier transform would consider adjacent)