Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper studies primarily studies the problem of robust mean-estimation, when a certain fraction of the data is corrupted adversarially. Then, in this setting, the authors propose a nearly-linear time estimator(which is also practical), which achieves the information-theoretic optimal rate. The authors back their claims by conducting experiments on CIFAR, word embeddings etc. These are some of the first experiments to be done at this scale in the robust statistics community. The paper is well written and the authors provide a lot of intuition for their algorithms. The authors are very rigorous in their theoretical claims for robust mean estimation. A minor criticism is the handling of the "outlier detection component". I agree with the authors that prima-facie it is not clear, when the outlier detection problem is meaningful. But, then, what is the the output of the algorithm 2 giving? From my understanding, I can see that if the outlier distribution is such that operator norm of the covariance of the mixture increases, then the weighting given by algorithm 2, can be used to identify the outlier points -- is this correct? But is this an if and only if? For example, say the outlier distribution N, is a point-mass at the true mean of D, then what would algorithm 2 give? Minor Question: --In step 3 of algorithm 2, is there a centering step missing?
This work considers the problem of identifying adversarial outliers, with applications to robust mean estimation. The work proposes to use a "quantum entropy" regularization which can be solved quickly using a matrix multiplicative weights algorithm. This leads to a near-linear time algorithm. The paper has a nice high-level discussion of how the entropy term improves upon the previous, spectral filtering methods and obtains the improved running time: intuitively, whereas the spectral filtering corrects a single direction at a time, the entropy term encourages filtering many directions simultaneously. Obtaining a (near-)linear time algorithm is pretty significant, as the time complexity significantly impacts the breadth of settings in which such methods may be applied (and of course, we can't hope to beat linear time). Large data sets demand linear time algorithms. The paper also includes experiments demonstrating that the method is effective. The experiments do convincingly show that the method is effective at removing outliers compared to several simple methods. The experiments are pretty well constructed, considering some natural models of contamination in two real data sets (CIFAR and word embeddings) and a synthetic data set. Compared to the simple (fast) baseline methods considered here, the new method does show significant improvement at identifying outliers. I have a few small criticisms here, though. First, even though the polynomial-time algorithms proposed in prior works are not near-linear time, is it infeasible to run them on these data sets? Since the running time guarantee was the main theoretical focus of the work, I would have liked to see some evaluation of the running times in addition to the accuracy; conversely, it would have been interesting to see the AUC scores achieved by these slower polynomial-time methods. If they are the same, then the new method should be a clear "win," but even if they achieved a better AUC in practice, it is desirable to understand what the trade-off is. I realize that the baseline methods have been chosen because they are fast, but they seem a bit weak, and it would be good to include a strong-but-slow baseline to understand if the new method gives anything up in practice. Finally, the plots for the experiments are so small that they are unreadable unless one zooms in or uses a magnifying glass. There is no point to including them in a print version. As nice as the extensive plots are, it would be better to choose a (representative) selection and just note that the others are similar.
Originality: The work proposes a new technique based on outlier scoring for robust estimation and it intuitively builds on the ideas of previous works. This can also be used for outlier detection. It also cites the related works adequately. Because of being the first (nearly) linear time algorithm, I think it scores well on originality. Quality: Experiments: I think there should be more experiments for robust mean estimation (see the improvements section for comments on this). Proofs: The authors have provided proofs for their claims. I found a few typos (see the improvement section for comments on this). I also have one question for the authors - for inequality (b) in (21) how did you bound \|\rho\|_2 \geq 1? Clarity: The paper is well written and well organized. However, because there are so many symbols and definitions, I would recommend a table pointing to definitions of terms like Goodness of Set,M, U. Also, it would be nice to explicitly define S_r and S_b ( I couldn't find them explicitly defined somewhere.) Significance: The significance is high because this paper gives the first linear time algorithm for robust mean estimation and the technique used seems intuitive. Because this is linear in time and thus practical, it may benefit the robust machine learning research in a variety of ways.