Review for NeurIPS paper: VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

NeurIPS 2020

VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain

Review 1

Summary and Contributions: This work proposes a self-supervised framework for representation learning with tabular data. The authors propose a multi-headed self-supervised training model that first corrupts (augments) the input tabular data using a binary mask, and then one head reconstructs the mask while the other head reconstructs the uncorrupted data. In addition, the authors use a standard supervised loss function for data that contain labels, an addition that makes this model applicable to semi-supervised learning. The authors demonstrate the effectiveness of their multi-headed reconstruction pretext task on a genomics dataset, patient treatment dataset, and two tabular benchmark datasets (UCI Income & Blog) as well as MNIST treated as tabular data.

Strengths: Overall, machine learning on tabular data is an understudied problem, and this paper lays out a clear and justifiable explanation for the development of their self-supervised pretraining approach for tabular data. The paper proposes a novel 2-part reconstruction task for masked tabular data: where both reconstructing the mask itself and the unmasked input data are the two feedback mechanisms for the self-supervised learning. The paper studies a unique set of genomics and patient treatment datasets that tie in nicely with the original motivation of the paper. The experimental results look promising, and the authors include a few ablations to better understand the benefit of the semi-supervised learning component. The applicability tabular data and the empirical evaluations are the primary strengths of this work.

Weaknesses: My central concern for this paper is the misalignment between the motivation and methodology. As motivation, the authors argue that self-supervised CV and **NLP** “algorithms are not effective for tabular data.” The proposed model, though, is effectively the binary masked language model whose variants pervade self-supervised NLP research (e.g. WordNet, BERT, etc). Granted, instead of masking words, the proposed models are masking tabular values, but this is performing a very similar pretext task. In fact, there is concurrent work that learns tabular representations using a BERT model [1]. At the very least, I think it’s worth a discussion of how this masked entry model is similar to a masked language model. I believe this paper also overlooks [2] as related research. Line 167-168: The justification for using the two-component pretext tasks is that it is a difficult individual task. Did you explore using only one of the two-components? Line 195: Is it true that the correlation structure is less obvious in tabular data than in images or text? The semi-supervised learning aspect of this paper described in S4.2 (using a weighted combination of an unsupervised loss function and a supervised loss function) is well established, e.g. [3], and I think this paper could focus more on the novelty of the pretext tasks for tabular data. It would be interesting to experiment and measure the performance of alternative corruption (augmentation) models and their impact on different kinds of tabular data. [1] TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data, https://arxiv.org/abs/2005.08314 [2] TabNet https://arxiv.org/abs/1908.07442 [3] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems, 2005.

Correctness: The methods and empirical methodology appear correct.

Clarity: Overall the paper is clearly written. I believe Figure 1 and Figure 2 could be easily combined. Typos: Line 64: “multivie” Line 225: Missing space after sentence.

Relation to Prior Work: (discussed in "Weaknesses")

Reproducibility: Yes

Additional Feedback: UPDATE After reading all reviews and the author's responses: they addressed my primary criticisms of (i) leaving out related works (ii) overclaiming on their novelty (iii) clarity issues. Some of these concerns were also shared with other reviewers. With the proposed updates, I think this paper will be a worthwhile contribution.

Review 2

Summary and Contributions: This manuscript contributes self and semi-supervised approaches well suited to tabular data. The point being that tabular data does not come with obvious invariants and corresponding transformations that can be used to create selecf supervision. The contributed method relies creating representations that facilitate learning.

Strengths: The work contributes a new reconstruction loss for unsupervised training of representations. This loss extends auto-encoders practice with a pretext task that uses the marginal distribution of features. It can then be used to help training intermediate representations in a semi-supervised setting, to improve prediction, adapting existing frameworks The manuscript contributes empirical benchmarks on a genomic dataset as well as clinical data and a few UCI tabular datasets, demonstrating some increasing in performance.

Weaknesses: The manuscript is tackling tabular data, however it avoids the problem of categorical entries, which are frequent in such data. In particular, the squared loss is used (eq 6), which is not very relevant for categorical data. Likewise, in the experimental validation, the data used do not seem to have categorical data, although the UK Biobank does have categorical features, beyond genomics. As a baseline for the genomics experiments, it would have been interesting to use a PCA to learn representations. In genomics, such a simple model often performs well. With regards to the encoder baseline: were the data centered and normed before fitting an auto-encoder? Indeed, in the absence of standardization, the reconstruction loss is brittle.

Correctness: It seems methodologically correct, but the baselines need to be run well, in particular standardization of the data.

Clarity: Overall, the paper is well written. I must confess that it took me a while to understand that the "supervised only" line of table 2 was the same thing as the "2-layer perceptron".

Relation to Prior Work: The relation to other works is well discussed.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper extends the self/semi-supervised learning to the tabular domain. VIME is proposed to estimate the mask as a pretext task. Experiments on related datasets show their superiority. After reading the response, the authors resolve part of my concerns. I revise the overall score to 6.

Strengths: (1) The extension to the tabular domain with mask estimation is interesting and useful. (2) The authors conduct extensive experiments. Experimental results look good compared with Mix-up.

Weaknesses: (1) I think the novelty is limited. As introduced in the paper, self/semi-supervised learning has already been thoroughly investigated in other domains, including the image and language. Tabular domain aside, feature vector estimation is common in auto-encoder, and the novelty of the proposed mask estimation is not good enough. I think the existing Gaussian noise based augmentation and estimation is very similar except the difference in distribution. (2) The motivation that you generate the masked samples by Eq.(3) is unclear. Why you add the first term, especially after the shuffle operation? (3) I think the mask m does not need to be binary. Have you ever tried other distributions, such as the Gaussian distribution? You only claim that your approach is more difficult. I would like to see more detailed analysis as well as experimental comparison. (4) For the results in Table 2, I wonder how you compute the accuracy for the method 'Self-SL only'. I'm afraid that there is no classification module in the self-supervised learning framework.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Please address my concerms in section of Weaknesses.