Reviews: Evaluating Protein Transfer Learning with TAPE

The manuscript presents a set of diverse protein prediction tasks, with the purpose of establishing a benchmark for testing representation/transfer learning on protein sequence data. In addition, it establishes a strong baseline for the field by implementing a range of different standard sequence models, and demonstrating their performance on a benchmark set. I expect both the benchmark set, and the results reported in this paper to have a substantial impact on the community. Below are some comments and suggestions for changes Page 3. Since the goal is to "ensure that no test proteins are closely related to train proteins", it would be informative if the authors could state the expected (or maximum) sequence identity between PFAM families. Wouldn't it have made sense to do the split at the clan level, to reduce the chance of information leakage between families within the same superfamily? About task 2: much of recent progress in protein structure prediction comes from prediction of distance distributions rather than a simple binary classification of contact presence. You could perhaps consider modifying task 2 to this more challenging problem, as it is closer to a real-world application. I realize that this is not feasible within the time-frame of a NeurIPS rebuttal, so this is merely a suggestion for future development of the benchmark. About task 4: Is the Hamming distance by which you measure mutation-degree at the nucleotide or the amino acid level? As far as I can see, both would have subtle problems. If it is done at the nucleotide level, then the exact same amino acid sequence might end up in both train and test set (due to synonymous mutations). If it is measured at the amino acid level, then not all mutations at the same Hamming-distance would be equally distant. The authors should clarify this. Page 6, line 208 The authors write "(Assuming a Markov sequence)", however, it is not clear to me that such an assumption is actually made, since the model is conditioned on all preceding amino acids. Perhaps I missed it, but it was not entirely clear to me which representation was used as the basis for the different downstream tasks. For instance, for the LSTM, do the authors simply use the 2x1024 hidden unit state at each position as feature representation? Doesn't this mean that the dimension of the representation is different for the different methods. Have the authors investigated whether the difference in representation size alone has an effect on downstream performance (for instance, one would imagine that the downstream model would need a larger amount of parameters if the input dimensions were larger). Furthermore, for the tasks where a single prediction is made for an entire protein, how are the individual representations combined? Is this done by averaging over the individual representations, or by some more complex scheme (such as the averaging + last-state used by UniRep for some of its applications). Quality: The work is systematic and carefully executed. Apart from the questions above, I have full confidence that the submission is technically sound. Clarity: The manuscript is well structured and well written, and emphasis has been placed on making it accessible to a broader ML community. I have a few minor comments that could improve the clarify even further 1. Figure 1 (a). The blue and yellow strand cartoon next to the input and output are confusing. The yellow should be removed. Furthermore, the E->C transition is not clearly visible in the chosen cartoon representation. 2. Figure 1 (b). There seem to be some contacts missing between the first and last strand in the sequence. There is a single points, but I assume there should be an entire strand. This is certainly a detail, but for the sake of clarify it would be good to get this right. 3. Page 4. Line 143. Perhaps add (H), (E), and (C) labels after the full names in {Helix, Strand, Other}, just to make the coupling to Fig 1a clearer. 4. There seems to be a word missing in line 262: "tasks with more for significant improvement". Originality: The submitted work is original, and clearly cites earlier work. Significance: There is currently a lot of activity in semi-supervised and unsupervised learning of biological sequences. It has, however, been difficult to assess the extent of the progress that these approaches constitute over earlier methods. This current work represents a real step forward in this direction, by establishing a solid benchmark, and by clearly demonstrating the areas in which the learned representations still underperform classic Bioinformatics approaches. I expect it to have a substantial impact on the field. Updated after rebuttal: the authors have addressed my concerns. I have therefore updated my score to 8.

This manuscript assembles and describes a set of benchmark data sets for use by the machine learning community in evaluating semi-supervised learning in relation to various properties of proteins. The authors also demonstrate the application of several self-supervised pretraining methods to these tasks. The manuscript is extremely clearly written, which is critical in work for which one of the primary goals is communicating to non-specialists in computational biology. The benchmarks are well constructed, though the actual work involved in curating them was not particularly substantial, since most of the benchmarks have been previously published and are merely collated in this work into one collection. Still, the utility is clear, since practitioners can now go to one place and test their methods using a single interface on a variety of tasks. In terms of modeling, the approaches they use represent architectures that frequently yield state-of-the-art performance in NLP or computer vision, such as transformers or ResNets. One concern is that the models employed here are huge (on the order of 38M parameters). The manuscript should provide some justification for using such large models. The concern here is two-fold: (1) these complex models may not be as powerful as simpler models, especially on some of the smaller benchmarks, and (2) beginning with massive models that require significant compute resources may discourage potential users of the benchmarks. A significant problem with all of the benchmarks is the simplicity of the evaluation, where only global accuracy metrics are considered. Methods like precision-recall and ROC analysis are frequently employed in this domain. It might also be helpful to explore metrics such as accuracy at predicting long range contacts or error partitioned by the true label. Note that this would mean evaluating accuracy for low stability and high stability proteins separately. line 150: The CB513 data set is quite standard in secondary structure prediction, but it is also quite old, dating from 1999. A newer, larger benchmark should be employed. Minor comments: line 1: "Protein modeling" is vague. Clarify what task is being addressed here by including information from lines 37-38. line 75: The term "homology" is reserved only for proteins that share a common evolutionary ancestor. If the proteins share similar function but no common ancestor, they are not homologous. line 125: The text says that the test set is constructed by holding out families, but in the next sentence "For the remaining data we construct training and test sets using a random 95%/5% split." This makes no sense. line 134: "8 thousand" -> "8000"

Paper ID:	5125
Title:	Evaluating Protein Transfer Learning with TAPE

Reviewer 1

Reviewer 2

Reviewer 3