Paper ID: | 4163 |
---|---|

Title: | Globally Convergent Newton Methods for Ill-conditioned Generalized Self-concordant Losses |

This paper proposes an approximate Newton algorithm (using e.g. sketching or subsampling) for minimizing generalized self-concordant convex functions for supervised learning problems. The main novelty of the algorithm itself appears to be that, instead of finding the region of fast convergence with a linear method, they use their own algorithm, with a decreasing sequence of regularization parameters. The primary benefit of their approach is that it extends earlier convergence convergence rates that are indepentent of the condition number (in the self-concordant setting) to the generalized-self-concordant setting (although it apparently isn't quite so simple---see Remark 1 of the paper). Writing Quality --------------- First, the paper is overall well-written, and fairly comprehensive (i.e. I think that the authors considered everything that needed to be considered, although one might argue about how much attention was paid to various portions). With that said, there are a few things that confused me, and might be possible to clarify: 1) In definition 1, what does the bracket notation after \nabla^3 and \nabla^2 mean? Also, a discussion of how, exactly, this definition differs from that of [6], and why these differences matter, would be appreciated. Going further, it might be nice to discuss how it differs from regular self-concordance. 2) The "contributions" paragraph of the introduction is valuable, but I think it would be more useful (at least to a reviewer, and we're presumably the intended audience of this paragraph) if it not only said *what* was new, but more explicitly *how it improved* on the prior state-of-the art. This is obviously intended to be fleshed-out in Section 2, but even there, the differences between the proposal and the references are not explicit. For example, I'm not sure how this paper differs from prior generalized-self-concordant work (e.g. [6,7]). Is the main difference in the method by which you find the region of fast convergence? That they didn't use approximate Newton (or used a "different" approximate Newton)? That they did not pay as much attention to condition numbers? Something else? 3) Following-up on this, could some of the information in Section 2 be summarized in a Table? With corresponding cuts in the section itself, this could wind up being more space-efficient, in addition to being easier to use as a reference when reading later in the paper. 4) A lot of notation is introduced, and it's hard for a reader to keep track of all of it (Q, for example, stymied me for a while, in the discussion after Theorem 1). A table of notation in the appendix might mitigate this. Finally, as a minor quibble, late in Section 3 the expression "way smaller" is used repeatedly. I'm not complaining about the informality, but the repetition is jarring. Contribution ------------ I'm not familiar enough with the background literature to judge the magnitude of the contribution of the paper. I assume that that the idea of using an approximate Newton method based on relative approximations as in Definition 2 is *not* new---is this correct? If so, then it appears to me as if the main contribution is in using their own algorithm (instead of a first-order algorithm), with a decreasing schedule of regularization parameters, to find the region of fast convergence. Assuming that this really is a new approach (is it?), and that it is what's mostly responsible for reducing/removing their dependence on the condition number (is this true?), then it seems to me to be a significant contribution. More to the point, I think it would be of sufficient magnitude to warrant publication in NeurIPS. But, as I said, I don't have enough familiarity with the background literature to really know. With that said, there are two obvious weaknesses: (1) this work only applies to certain particular convex problems (Example 1 in the paper discusses the main instances), and (2) many machine learning practitioners have developed a well-earned skepticism of the applicability of higher-order methods (or really, of anything with a high per-iteration cost) to ML problems. These, particularly point (2), might seem unfair, and to some extent they are (past failures may indicate future failures---hence "well-earned"---but they don't require them), but thankfully they could still be addressed with sufficiently comprehensive experiments. The experiments are decent. They use large real-world datasets on realistic problems, and achieve impressive-seeming results. What they were not---and for the reasons mentioned above I think this is important---was comprehensive. For one, both were in the kernel setting, and for another, both used logistic regression. There was also only one baseline (KSVRG). Where were the other settings mentioned in Example 1, and where were the numerous potentional baselines mentioned in Section 2? More importantly, these are machine learning applications, so "distance to optimum" doesn't strike me as being the correct metric. What one is really interested in is "accuracy on a test set", which I suspect plateaus at a far less precise solution than is plotted in Figure 1 (i.e. I suspect that KSVRG, and other alternatives, will appear far more competitive measured in terms of test accuracy). ==== AFTER AUTHOR RESPONSE ==== Thank you for responding so thoroughly to my review. The changes that you suggested address nearly all of my concerns, and I'm happy to see that your approach does indeed outperform KSVRG in terms of classification error. I'm raising my score to 6.

Originality ------------ The paper present a very novel Newton method algorithm adapted to large-scale ill-conditioned problems. The authors clearly state how their algorithm differ and relates to the literature. Quality --------- The authors present a very complete study of an optimization algorithm and provide a. Due to the (over)length of the appendix, it is difficult to carefully check all the proofs. Maybe a journal would be a more appropriate submission track. Nevertheless all claims in the paper are justified and the proofs look technically sound. For completeness and transparency, the experiment against FALKON should be included in the main paper, as it is a directly related work. Clarity -------- The paper is very dense in information, but well written and organized. All the elements required for the understanding of the paper are provided. There is some formatting issue line 211: the box is too big. No conclusion is provided. Significance --------------- Even tough a detailed pseudo-code is given in the appendix, the impact and significance of the paper could be improved a lot by providing a functional piece of code, as this algorithm could interest and impact a large community. I have read the author response and my opinion remains the same. I have read the author response and my opinion remains the same.

I thank the authors for their response, and would like to maintain my overall evaluation. ===== The paper is generally well-written and addresses a fundamental problem. The proposed algorithm seems sound, although I will let other reviewers comment on its novelty as I may be not familiar with the most recent developments in second-order methods. The appendix is quite comprehensive, but I have not gone through to verify the proof details. The experiments (Section F in the Appendix provides more details) appear to confirm the theoretical claims, although it may be helpful to discuss or compare with the Newton sketch of [24] (same as [25]). A couple questions regarding related work: - I'm still unclear how the proposed method compares to the Newton sketch of [24]/[25] both theoretically and empirically, in terms of accuracy and computational complexity. - As the authors noted in Section 1.1, methods based on sketching and sub-sampling are essentially independent of the condition number. In this case, what are the advantages of the proposed approach?