NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 5049 DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization

### Reviewer 2

Originality. As far as I understood, the paper proposes an extension of [15] for the distributed setup. This requires careful analysis, but not so many new ideas. I think the comparison with https://ieeexplore.ieee.org/abstract/document/8675539 should be added. Quality. The numerical experiments seem to be convincing. I didn't check the proofs thoroughly, but the high-level intuition is not clear for me. As far as I see, the method uses the inverse of individual Hessians H_i,t, but not the inverse of the full Hessian H_t. I don't understand, why this works in the light of the fact that H_t^{-1} != \sum H_i,t^{-1}. Also, it is not clear for me why the considered three cases are mutually exclusive. Clarity. Here I have the same questions as in the quality part. It would be nice to provide more explanations. At the same time, I very much appreciate that the authors give some intuition and illustration for their quite complicated assumptions and their connection to the standard assumptions. Significance. I think the method can be useful in practice. Especially since it has smaller number of hyperparameters, wider range of applications including non-convex functions. ===========After rebuttal============ I can't say that the authors addressed all my questions from the review. Especially the ones in the quality part. But, I found some answers in the GIANT paper. So, I leave my score unchanged.

### Reviewer 3

As is stated in question 1, I like the idea of not using the strong convexity assumption. My main concern is associated with the computational costs of computing $H_{t,i}^\dagger g$ and\or $\widetilde H_{t,i} g$. It seems to me that (section 4) iterative methods are used to compute these matrix-vector multiplications. However, the convergence analysis seems to require exact computation. The authors are also suggested to elaborate more on the communication cost associated with the line search step. Other comments: 1. The algorithm DINGO involves a few hyper-parameters. It would be good if the authors can discuss how these hyper-parameters are tuned so that the algorithm can achieve better performance. 2. I am not sure whether the step length \alpha_t can eventually be chosen as 1. 3. For the convergence of the algorithm, many assumptions are needed (e.g., Assumptions 1,2,3,5). I am not sure whether the example considered in Section 4 satisfies the assumptions or not. 4. In the implementation, the authors use iterative solvers without preconditioners. If the subproblems have bad conditioner numbers, I do not know if these iterative methods can obtain solutions with sufficient accuracy to guarantee the progress of the algorithm.

### Reviewer 4

Strength/weakness/questions/suggestions: 1- The paper is well-written and it is also structured properly. 2- In Algorithm 1, it is needed to calculate the product of pseudo-inverse of $\hat{H}$ and some vectors $g_t$ and $\tilde{g}_t$. It can be costly. It would be more clear if the authors clarify more about it. 3- In equation (6), it is expensive to calculate $P_{t,i}$. There is an inverse calculation (which the required info can be calculated by a system of equations too, however expensive), and in each iteration of the algorithm, there is some expensive parts mentioned in the above and current points plus “Line Search”. 4- In Algorithm 1, there is no explanation how “Line Search” will be done. Is it done in distributed environment? Or just in master machine? Also, the “Line Search” mentioned in this paper is very expensive (full gradient calculation is needed in each step), and if this “Line Search” is done in master node, then it may happen some cases that the master node would be very busy, or equivalently, the algorithm would not scale well (according to Amdahl’s law). 5- Assumption 3 seems to be a strong assumption. Also, the assumption in Lemma 1, “the Hessian matrix $H_{t,i}$ is invertible”, is strong too. Is the latter assumption is based on strong convexity? The reviewer did not notice why this assumption should be true. 6- In this paper, in several parts it is mentioned “a novel communication efficient distributed second-order optimization method”, however, there is no analysis regarding the required number of communication rounds to reach the solution that shows its efficiency (similar to DiSCO algorithm). 7- The reviewer could not find any information about how DINGO is distributed in practice? Do the authors use GPU or CPU for their calculations? The reviewer was eager to see the code for the proposed algorithm, however, no code was available to check how DINGO is distributed. 8- In Figure 1, it is suggested to compare the results based on wall-clock time not just the communication rounds. In some cases, the expensive calculations may be done in master node, therefore, less communication rounds would be needed, however, the wall-clock time would be very high. Another reason is that, the number of communication rounds is somehow equivalent to the number of iterations (not a good measure though), and it is suggested to compare the results of distributed algorithms based on true measure, i.e., wall-clock time. 9- In Figure 1, rows 1,2 and 4, did the authors consider the cost of line search when x-axis is “Communication Rounds” for DINGO? 10- In Figure 1, what do the authors mean by “Worker”? The details should be provided. 11- It is clear from row 3 of Figure 1, that the “Line Search” mentioned in Algorithm 1 is expensive (many gradient evaluation is needed), and not efficient if the authors want to have well-scaled algorithm. ============= After Rebuttal ============= I read the rebuttal carefully, and the authors answered my main concerns. I am generally satisfied with the rebuttal, and thus increased my score.