Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Quality: The paper is technically sound. The claims and proofs are well written and supported by experiments from both public and private datasets. This is a complete piece of the work. Clarity: The paper is mostly clearly written and well organized. However, the task goal discussion and the result part could be improved. Originality: The paper brought a new method to solve an existing task. The related works are clearly cited. The method is novel and different from past works. Significance: The result is significant and could become the new baseline in this area.
The paper presents a method for scaling up classifiers for tasks with extremely large number of classes, with memory requirements scaling with O(logK) for K classes. The proposed model is effectively using count-min sketch to transform a very large classification problem to a small number of classification tasks with a fixed small number of classes. Each of these models can be trained independently and in parallel. Experimental results on a number of multi-class and multi-label classification tasks shows that it either performs as well as other more resource-demanding approaches or it outperforms them, depending on the dataset, with a controlled reduction of model size based on the selection of the B and R parameters. When it is used for modeling retrieval as a classification with a very large number of labels, it outperforms Parabel, which is also limited by memory requirements.
This paper studies the task of extreme classification with a large amount of target categories. It developed a hashing-based algorithm, MACH. MACH leverage multiple hash functions that each maps the label classes into a small set of groups. Then a classifier is trained and applied for each hash mapping, on the reduced problem with much smaller amount of target classes. The prediction results of the sub-classifiers are then combined to re-constructed the final output. The proposed methods are demonstrated to be both efficient and effective in multiple datasets. Pros: a simple and efficient extreme classification method. The divide and concur approach works well on several datasets. Cons: The effectiveness of proposed method seems to be rather data dependent. It works well on ODP but not as much on ImageNet. There is no clear discussion of this divergence. The experimental set up and the motivation of 50M IR experiment are a little supervising. The standard IR approach is not to treat each target document as a target class. It is unconventional to treat the product search task as a multi-class classification task. The application of extreme classification in this setting, which is the main motivation of this paper (as stated in the title and introduction), seems arbitrary. The baseline compared in the production search setting is also rather weak. Only simple embedding-based ones are compared.