NeurIPS 2020

Field-wise Learning for Multi-field Categorical Data

Review 1

Summary and Contributions: The proposed method employs the natural structure of data to learn simple and efficient models. The models can be fitted to each category and can better capture the underlying differences in data.

Strengths: 1, The writing of this paper is clear and easy to understand. 2, The performance of the proposed method outperforms other methods in two important evaluation metrics.

Weaknesses: 1, Some recent related methods are missed in the experiment parts. 2, Compared with IPNN and OPNN, the parameters are much large, but the performance only raise a little. The improvement of your method is not impressive. 3,The caption of Figuer 1 iis missed. 4, I want to know more about the datasets, which is different about traditional dataset,

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper proposes the field-wise learning for dealing with multi-field categorical data. Specifically, the proposed method is based on the linear models with variance and low-rank constraints. Also, it leverages the structure of data to learn the one-to-one field-focused models. A generalization error bound to theoretically support the proposed constraints as well as some explanation on the influence of over-parameterization is proposed. Experiments are conducted on the Criteo and Avazu datasets.

Strengths: + The problem studied in this paper is interesting and practical. + The paper is clearly written.

Weaknesses: - The proposed method lacks technique contributions. The main components of field-wise learning is simple and straightforwards, even popular used in other related machine learning / data mining tasks. - The generalization error bound seems tight, which cannot bring a theoritical basis. - In experiments, the datasets used are Criteo and Avazu. More real-world datas should be considered as important test beds, e.g., iPinYou, etc. - The compared methods are out-of-time. The very recent one was published in 2017. It cannot validate the effectiveness of the proposed method. Recent state-of-the-arts published in KDD, NeurIPS should be compared and discussed.

Correctness: The method and claims are good.

Clarity: This paper is easy to follow and clearly written.

Relation to Prior Work: Lacks discussions with recent state-of-the-arts as well as empirical comparisons. The authors are encouraged to discuss with them.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This article presents a new method for multi-field learning, i.e. learning with nominal variables. The proposed method learns a model for every field, and every value of a specific field. A low-rank regularization mechanism is applied such that the models for different values of the same field look similar. The models of different fields are aggregated to obtain the final prediction. Linear models are used for the different fields and all models are jointly optimized in a single optimization problem. The authors present generalization bounds and experimental results on two datasets.

Strengths: Overall this is an interesting method that is substantially novel. The paper is also well written. I would recommend to accept this paper, but I do have some suggestions for improvement.

Weaknesses: - some more in-depth discussion of differences with existing methods would have been useful (see below) - not very clear why the method outperforms other methods

Correctness: It is nice that the proposed method outperforms existing methods, but it is not very clear to me what the main reason is. Does the improvement come from the specific model structure, which regularizes individual models for specific values of a specific field, or does the improvement come from the ensembling effect created by the aggregation function F? If the latter is the case, then it is quite normal that the new method outperforms for example logistic regression, which fits a single model to the data.

Clarity: Yes, very well written.

Relation to Prior Work: One could argue that the method is an extension of multi-task learning methods that adopt low-rank regularization. Multi-task learning methods are often applied to one important specific field, e.g. location. The method of the authors extends this idea to multiple fields. I would have liked to read a bit more about that in this paper. Similarly, there is also a connection with factor models and mixed linear models in statistics, which are developed specifically for multi-field data. From that perspective, I found the notion multi-field data a bit confusing. These are simply called factors in statistics, so why invent a new name for something that has been thoroughly discussed in the statistics literature?

Reproducibility: Yes

Additional Feedback: In optimization problem (7) the models for the different fields all have the same rank r. Is that not a serious limitation? I can imagine that some fields only take a few values (e.g. gender) whereas others have a lot of values (e.g. location). In such situations one should use different values of r. The proposed method also only considers categorical features. Most datasets, however, have a mix of categorical and continuous features. How could the method be extended to handle both categorical and continuous features? Related to this last remark, is building a model for every individual field really the way to go when the number of fields is large?

Review 4

Summary and Contributions: The authors present an approach for modelling categorical variables. Each categorical column in a table is termed ‘field’ by the authors. The main idea appears to be based on splitting the regularisation term for each ‘field’. The authors present a thorough derivation of their method. A linear and a nonlinear model are developed. These contributions are combined with a strong experimental section that shows the strength of the proposed approach in comparison with other approaches. I enjoyed reading this paper and think it could be a relevant contribution to the field. Admittedly, when first reading it, I thought the methodological contributions are not overwhelming: there exist many similar approaches inspired by Canonical Correlation Analysis / Multi-view learning / Collective Matrix Factorization / Dictionary Learning / Imputation, see e.g. [1,2,3]. Most of these papers treat (or at least could treat) the problem of dealing with a heterogeneous mix of continuous and different categorical variables in a similar manner - without writing a separate paper about it how they model these categorical variables. But I feel this topic is important and actually does deserve more attention rather than being glossed over en passant when writing about a novel matrix factorisation or neural network method. It’s not a mere preprocessing choice how to model categorical variables, especially when they have high cardinality and are dirty (like strings with typos). As the manuscript has a great experimental section I’m leaning towards accept here. [1] Singh et al. Relational Learning via Collective Matrix Factorization[2] Li et al. A Survey of Multi-View Representation Learning [3] Wu et al, Multi-view low-rank dictionary learning for image classification,

Strengths: The strength of the paper is a well thought through modelling approach for heterogeneous sets of categorical variables. Another plus is the solid experimental section. I particularly liked how the authors include the number of parameters in one of the results tables.

Weaknesses: I guess the main limitation is novelty, the approach is very similar to other forms of multi-view learning, as mentioned in the summary section.

Correctness: The approach looks sensible and the derivation appears sound. The experimental validation is solid.

Clarity: yes

Relation to Prior Work: The authors compare their work to a large number of related methods and a number of competitor methods are discussed. The connection to other multi-view methods (which are different in that they don’t model categorical variables only, but are still similar in their regularization) could be discussed.

Reproducibility: Yes

Additional Feedback: As mentioned, the relation to other similar methods could be discussed better, but I think the experimental comparisons are great as they are. And it would be interesting how the proposed method would work with more complex network architectures, as alluded to in the conclusion. For reducing the number of parameters, maybe a sparsity inducing norm rather than an L_2 norm on W would work, as in [4]? [4] Mairal, Online Dictionary Learning for Sparse Coding