NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
This paper proposes a new method for empirical composition optimization problems to which the vanilla SGD is not applicable because of a finite-sum structure inside non-linear loss functions. The method is a type of primal-dual methods with variance reduction for saddle-point problems which is a reformulation of the original problem. In a theoretical analysis part, a linear convergence rate of the method is provided under the strong convexity. In experiments, the superior performance of the method is verified empirically over competitors on portfolio management optimization problems. Clarity: The paper is clear and well written. Quality: The work is of good quality and is technically sound. Originality and significance: The problem (1) treated in this paper is important and contains several applications as mentioned in the paper. However, there seems to be no method that can converge at the linear rate for this problem. As for sub-problems (2), SCGD with SVRG [8] exhibits the linear convergence, but there are some important machine learning tasks not to be covered by (2) as explained in the paper. In addition, it is confirmed in experiments that the proposed method can significantly outperform existing methods including SCGD with SVRG. Hence, this paper makes certain contributions to both theorists and practitioners. If the authors can show a theoretical advantage of the proposed method over SCGD with SVRG [8] on the problem (2) besides the empirical performance, it will make the paper stronger. Minor comments: A regularization term $g$ should be added to equation (10). ----- I have read the author's response and I keep the score.
Reviewer 2
The author response addressed my doubts and questions to satisfactory and hence I raised my score. ------------------------------------------------------------------------------------ The motivation of this paper is good, while the algorithmic contribution is also solid and seems to have wide applications in many popular fields of machine learning. However, I feel the paper being weak in several aspects: Firstly, the discussion of the theoretical results are highly lacking: how does the theoretical convergence result compare to the related state-of-the-art algorithms such as the ones in the cited papers [1, 5, 8, 9]? How realistic is the bounded gradient assumption B_f ? (standard SVRG-type methods does not require bounded gradient). In addition, the algorithmic idea of combining the stochastic variance-reduced gradient and primal-dual reformulation is not brand new [1, 5], but seems that the discussion of the relevance of these work is lacking. The experiments are limited only for portfolio management optimization with very small datasets (d = 25, n < 10000), which is a bit frustrating in my sight. I would expect much larger-scale experiments, and ideally with some more applications such as policy evaluation [5] in reinforcement learning, which the proposed algorithm can be also applied. I wish also to see the experimental comparison of the very recent C-SAGA algorithm (Junyu Zhang and Lin Xiao. A composite randomized incremental gradient method. ICML 2019). In short, I think the current version of this paper can be massively improved and wish to encourage the authors to continue working on this.
Reviewer 3
The authors have satisfactorily answered my concerns and I am happy to raise my score. The complexity comparison table is interesting and should be included in the final version. ========= This paper studies a variance reduced primal dual gradient algorithm for solving strongly convex composite optimization problems (with a finite sum structure). The algorithm is shown to enjoy a linear rate of convergence, both theoretically and empirically, and the variance reduction structure has allowed for efficient implementation. The paper also provides a few convincing experiments on real dataset compared to state-of-the-art algorithms. Overall, the paper has tackled a challenging problem with an efficient algorithm and the proof appears to be correct. Here are a few comments from the reviewer: 1. The complexity bound in Theorem 1,2 The theoretical convergence rate of the algorithms grow with the order of O(n_X (n_Y + kappa) log(1/eps)) - since both n_X, n_Y are large, it seems to be undesirable. Particularly, since the epoch size M has to be chosen at the same order as O(n_X), it seems that after 1 epoch, the algorithm would only improve its optimality of the order of (1 - O(1/n_Y)). May I know if this is the case? In light of the above comparison, it seems that the algorithm in [8] (that is also compared in the paper) has a lower theoretical complexity for large n_Y,n_X, i.e., the latter only has a complexity of O( (n_X+n_Y+kappa^3) log(1/eps) ). Even though the proposed algorithm is demonstrated to be faster from the experiments, explaining such gap from an analytical lens is also important. 2. Strong Convexity in the Risk Adverse learning Example The risk adverse learning is given as an example in the paper and the numerical experiments. Yet it seems that the problem itself does not satisfy Assumption 4.1 required in the analysis. From (4) and the discussion that follows, we know that (4) is a special case of (2), where the latter is a special case of (1) with *g(theta)=0*. However, Assumption 4.1 requires g(theta) to be strongly convex, which is not the case here. 3. Related Reference The linear convergence rate result proven in this paper seems to be related to: S. Du, W. Hu, "Linear Convergence of the Primal-Dual Gradient Method for Convex-Concave Saddle Point Problems without Strong Convexity", AISTATS 2019, which studies the linear convergence of a variance reduced primal dual algorithm with only a non-strongly-convex + strongly-concave structure, whose primal-dual variables are linearly coupled. This is similar to a special case for the setting of (8), where (8) has a nonlinear coupling.