NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 537 An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums

### Reviewer 1

MINOR COMMENTS: Page 3, line 85. "low communication without sparsity assumptions". Please add references on such results. Page 3, line 99. "much faster than computing the gradient of the conjugate[..]". I agree with this, but it also shows that the comparison is not quite fair, since e.g. MSDA was not designed for finite sums on each node. Page 5, line 165. "generalized version of the APCG algorithm". I believe it is worth at this point to be more precise on what generalizations have been included in this algorithm, and why they matter.

### Reviewer 2

This paper proposes a new efficient accelerated decentralized stochastic algorithm (called ADFS) to tackle large-scale finite-sum optimization. The proposed algorithm uses local stochastic proximal updates and randomized pairwise communication between nodes. Some experimental results show the effectiveness of the proposed algorithms. Although the paper is theoretically sound, the current version can be improved in the following aspects: 1. The probability $p_kl$ is a very important parameter, and thus the authors should elaborate on this issue. 2. Does the topology of the network have great influence on the algorithm? If possible, more experiments need to analyze this issue. 3. The two data sets used in the paper are relatively small for distributed stochastic optimization. In particular, it is interested that whether the algorithm is effective for high-dimensional data sets, e.g., RCV1 and KDD2010. 4. Point-SAGA [8] was used to compare with the proposed distributed algorithm, but it is a serial algorithm for stochastic optimization.

### Reviewer 3

Motivated by the APCG method for empirical risk minimization, this paper proposed an an accelerated decentralized stochastic algorithm for finite sums. An augmented communication graph is proposed such that the original constrained optimization problem can be transformed to its dual formulation, which can be solved by stochastic coordinate descent. The proposed method employs randomized pairwise communication and stochastic computation. An adaptive sampling scheme for selecting edge is introduced, which is analogous to importance sampling in the literature. The theoretical analysis shows that the proposed algorithm achieves an optimal linear convergence rate for finite-sums, and the time complexity is better than the existing results. The experiments on logistic regression show that it outperforms the related methods and is robust with different network sizes. Overall, I think the theoretical results are novel. It would be better to provide some intuition of the algorithmic design. In particular, the choice of stepsize and R_{k,l} are not well clear. How do we obtain W_{k,l}\Sigma^{-1}y_t in Step 6? Step 9 in Algorithm 1 looks problematic to me. Shouldn't it be v_{t+1}^{i} = z_{t+1}^{(i)} - z_{t+1}^(i, j) + v_{t+1}^{(i, j)}? What is the purpose to have \mu_{i, j}? At the first glance, I thought \mu_{i, j} appears in the communication matrix A, but I get confused when I see (W_k\Sigma^{-1}y_t)^{(k)} \propto \mu_{k,l}^2 in Line 188. The homogenous setting may not be very interesting. One particular application of decentralized computing is mobile computing, in which different users (nodes) have different local sample sizes and networking conditions also vary in different regimes. Thus, I think the paper can be improved if varying local sample size and networking conditions can be considered in the paper. As the proposed ADFS method requires a number of hyper-parameters, how to choose them in practice? I don't find detailed description in either main text or supplementary material. Minor comments: 1. If I understand correctly, the inequality in Line 91 does not hold when batch computations are faster than communications. In this case, "communications are faster than evaluating a single proximal operator ... " would be wrong. 2. There is no definition of Vec(X_{i, j}). 3. The definition of \kappa_{min} in Section D.4 is missing. ******************************************* I have read rebuttal, and my options remain the same. I still encourage the authors to perform experiments on larger datasets (more samples and higher dimensionality).