Paper ID: 550
Title: Quartz: Randomized Dual Coordinate Ascent with Arbitrary Sampling
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
The paper presents a variant of dual coordinate ascent for regularized empirical risk minimization with smooth convex loss functions and strongly convex regularizers. The authors derive an update scheme where in each iteration only a subset of dual coefficients are updated. The authors proof the convergence of their method for any sampling scheme producing those subsets. Moreover, they derive sampling schemes to speedup the convergence. In experiments, the authors verify their theoretical findings.

The paper is well-written and scientific-sounding. The theoretical as well as the experimental results suggest that the proposed method outperforms existing algorithms when trained highly parallel, especially for very sparse data. My only concern is about the experiments which are not very conclusive: - combination of dataset, setting and method is not consistent

- why not performing serial and mini-batch experiments for all 6 datasets?

- why not reporting SPDC-roh=1 together with the other methods the serial setting?

- likewise, why not reporting on ASDCA in the mini-batch setting? - no reporting of wall-clock time; and no reporting of predictive performance and/or values of the objective; it's hard to tell wether primal dual gaps of 1e-4 and 1e-10 make any difference in practise - no setting other than block size m (=1), loss (smoothed hinge), and regularizer (l2); consider m>1 where sparsity may significantly drop, Ridge regression where the Lipschitz constant is unbounded, etc. - only one large real-world dataset; consider rcv1 and/or imagenet - please detail on the choice of lambda. It has a significant impact on the convergence speed, but also on the predictive performance of the resulting model. For very large datasets, "optimal" lambda effectively approaches zero; which (practical) impact would this have on the convergence?

Minor comments for the authors: - line 224: consider "the number of matrices A_i with any nonzero entry in row j" (it wasn't quit clear to me what "blocks" refers to) - Eq 17: colon - line 295: should be omega_j; consider adding max_j omega_j \leq n - "average sparsity", in line 295, is misleading; omega_j tilde is the "average number of non-zero entries for dimension j"; omega is an (absolute) measure of density rather than a (relative) measure of sparsity. Similarly, in Table 3, where 100% sparsity denotes a fully dense dataset. - titles in Figure 1 and, in particular, in Figure 2 are not very helpful; consider adding the 3 artificial datasets in Table 3; currently the sparsity (their key property) is only stated in the caption of Fig 2. - Figure 1: consider reordering the methods where both Quartz variants should be red - supp.: last line before Ch. 8 should be m > 1
Q2: Please summarize your review in 1-2 sentences
The authors propose a variant of dual coordinate ascent which is very efficient for very sparse data and can leverage distributed/parallel computing. The proposed method and the theoretical findings contribute to the fields, however, the experiments are not very detailed.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see
Quality ------- The paper is of high quality and the analysis appears technically sound and

complete. The empirical evaluation is rather brief, but shows that the theoretical

analysis matches the empirical performance.

Clarity ------- The paper is well written and structured, and is informative even without consulting the proofs in the appendix.

Originality ----------- While the proposed algorithm is very close to (variants of) SDCA, the extension to arbitrary sampling distributions and the corresponding analysis appear novel.

Significance ------------ Stochastic convex optimization is clearly a relevant topic, and the paper makes contributions both on the algorithmic as well as on the analysis side that other researchers can build on. For practitioners, the impact of the paper could be much improved by demonstrating the performance of the algorithm in the distributed setting, with data sets that do not comfortably sit in the memory of a single machine.
Q2: Please summarize your review in 1-2 sentences
The paper proposes and analyzes a novel primal-dual algorithm for stochastic convex optimization. The paper is of high quality, well written (though dense) and the proposed algorithm and analysis appear novel. The (limited) empirical analysis shows some improvement over state of the art in certain regimes.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Thanks to all reviewers for their comments.

Rev 1:

1) All your suggestions for additional tests are completely reasonable - in fact, we have done vastly more testing than what we have been able to include in the due to the page limit. We will do our best to try to add several new experiments to the main paper or suppl material.

2) We will add SPDC (\tau=1) to the serial setting plots and ASDCA to the mini-batch setting plots. We did test with large real-world datasets, such as astro_ph and CCAT. The performance in the serial setting is quite similar to what we report in the paper.

3) In the mini-batch setting, in our experience Quartz is better than SPDC whenever the "average number of nonzeros \tilde w" as defined in line 294, is sufficiently small. Notice that \tilde w in general does not only depend on the sparsity of the data. To demonstrate the performance of Quartz, we generated a random dataset with sufficiently small \tilde.

4) The computational costs per epoch of the considered methods (Quartz, SDCA, SPDC) are all of the same order as nnz(A). From this point of view, comparing the number of epochs is preferable to wall-clock time as it is independent of implementation and compute environment.

5) In the serial setting, lambda is chosen to be close to 1/n, which is a popular setting believed to lead to a good predictive performance. With lambda smaller than 1/n, the method will converge more slowly both in theory and in practice and in this case accelerated methods such as SPDC and ASDCA are normally a better choice. However, as shown by Fig 2, for certain datasets, even when lambda=0.1/n, Quartz can be better than the accelerated method SPDC. In general it is difficult to find a good balance between the convergence speed of the method (the choice of lambda) and a good predictive performance, the usual approximation-estimation trade-off.

6) We will fix all minor issues.

Rev 2:

Note that in Fig 2 we **do** show Quartz can be much better than SPCD. As shown in our experiments, for sufficiently sparse datasets, Quartz enjoys almost linear speedup (1/\tau) in the mini-batch setting but SPDC only has sublinear speedup (1/\sqrt(\tau)) where tau is the size of the mini-batch. This is an enormous difference for large mini-batch sizes!

Rev 3:

We will do our best to try to squeeze in an additional experiment - in a distributed setting.

Rev 5:


Rev 4 and 7:

Both reviewers question the benefit of our method (Quartz) in comparison with SDCA / I-prox SDCA since in their words the methods have the same complexity rate (we mention this in Sec 3.1) and similar empirical behavior (we mention this in Fig 1). In fact, the answer to your question is clearly articulated in many parts of the text, including the abstract, introduction, Sec 3.2, Sec 3.3 and Fig 2. Please read these!

1) The theoretical and practical performance of Quartz specialized to the serial sampling is indeed very close to SDCA/Iprox-SDCA. We considered it important to mention this for the benefit of the reader as this is an important special case. This is an observation meant to re-assure the reader already familiar with these state of the art algorithms - it is not the main selling point of the paper!

2) Note that while SDCA / Iprox-SDCA only update one dual coordinate at a time, Quartz allows for "arbitrary sampling" (a distribution over all subsets of coordinates) - which gives it massive flexibility and in our opinion will lead to many applications in the future. Quartz is the first primal-dual method capable of working with an arbitrary sampling.

3) When more than one processor is available, updating in parallel more than one dual coordinate can lead to significant speedup, provided that a suitable stepsize is carefully chosen to guarantee the convergence. Such mini-batch setting was not considered previously for SDCA nor Iprox-SDCA! This arises as a special case of our arbitrary sampling setup.

4) Quartz is capable of taking advantage of the sparsity of the data so that almost linear speedup (processing time divided by the number of processors used) is achieved when the data is sufficiently sparse. Such data-driven speedup property as demonstrated in Figure 2, does not exist in other SDCA-like methods; note that our method is vastly better than SPDC in such a regime.

5) The data driven speedup can be so large that our method can be better, both in theory and in practice (see Sec 3.3), than methods with momentum.

6) Our method is not merely a variant of SDCA, it is an entirely new method. Its analysis is simpler than that of SDCA, and unlike SDCA, it is directly primal-dual. As a result, in the special case of a serial sampling, we obtain the right logarithmic terms (this is not very important, but is nice to have nevertheless).