NeurIPS 2020

### Review 1

Strengths: To the best of my knowledge, the BSGD algorithm is the first stochastic-gradient based algorithm that directly solves CSO problem itself. The two most relevant work that focus on CSO are [12] and [24]; [12] solves a saddle-point problem reformulation of CSO, while [24] resorts to providing sample complexities for SAA approach to solve general CSO problem. With respect to the SAA approach presented in [24], BSGD method improves in sample complexities (they remove the dependence on d) when F is general convex, matching the lower bounds they provide. Although BSGD is not optimal when F is strongly convex and smooth, it matches the complexities of SAA approach[24]. They also argue about the settings in which BSGD may not be optimal, providing a transparent evaluation of their algorithm. In addition to sample complexities for general F, authors argue that BSGD matches sample complexities of existing methods for MAML, and BSGD appears to be comparatively practical. Authors provide a lower bound analysis for convex/strongly convex and nonconvex (smooth) F, with first-order algorithms under biased oracles. The constructions for smooth and nonconvex F were based on prior work ([5] and[10]) and assume a particular algorithm class (zero-respecting algorithms) and bounded initialization. Although I am not an expert in lower bound analysis, I checked the proofs for the lower bound constructions and arguments, which were thorough and intricate. I believe it is especially important to evaluate sample complexities of algorithms for CSO with respect to lower bounds to have a concrete reference point for comparison. Existence of lower bound analysis is significant. Authors argue about possible applications where CSO problem arises and provide experiments under several of those scenarios. I would comment on experiments in details later, but authors successfully present connections to practical implications of the CSO setting and hence their algorithm. Due to relatively small inner sample complexity, BSGD has practical advantages.

Weaknesses: SCO has gained popularity recently and CSO is a particular variation to it, which has concrete implications in practice. Although authors discuss their approach in conjunction with the most relevant prior work, the related work section is rather short and not detailed enough. The authors could have discussed individual papers (or in groups based on their relevance to each other) to highlight the main differences and overlaps between settings/assumptions they have and the algorithmic frameworks. CSO setting itself has been motivated in introduction but the paper lacks the narrative that clarifies differences between relevant problems settings and algorithms that are tailored for those particular problems. I believe this would illustrate why and how the CSO setting is important in a broader perspective. Second, authors argue that dual embedding and the primal-dual approach in [12] does not apply to nonconvex objectives, however, there are settings where BSGD and the algorithm in [12] have an overlap (when problem is convex). As far as I could understand, authors could have discussed the sample complexities for such cases. Similarly, authors could have devised an experimental setup to compare both methods in practice. Overall, I do not have any major comments about the general weaknesses of the paper. I have a few questions for technical details, which I believe the Additional comments section is more suitable for. ==================== POST-REBUTTAL ==================== My major concern was on the related work section, and authors address it in their rebuttal. My other comments were mostly suggestions for clarifications. I believe the lower bound analysis, together with the sample complexity results makes a good paper for NeurIPS. I am increasing my score.

Correctness: I checked the proofs for sample complexities in full details and validated the consistency of lower bound analysis. I fervently believe that their sample complexity results, theorems in the main text, and the lower bound computations are correct. Worst-case sample complexity analysis is straightforward and they make use of classical approaches. Therefore, proofs steps seem to be correct. As for empirical approaches, authors provide the full description of the numerical setups in the main text and appendix. The paper is transparent with respect to how each experiment had been carried out. Experiments are devised in such a way that they provide observations about the algorithm’s behavior under different settings (smoothness of the outer function vs mini batch size trade-off) and it compares their method with respect to existing methods in a fair manner. I have a few constructive comments with respect to numerical procedures in the Additional comments section.

Clarity: I find the paper well-written; the structure is easy to follow and the authors make their claims and contributions clear throughout the main text and appendix. An overarching comment about the clarity is that authors sometimes did not define the random variable with respect to which the expectation is evaluated. Due to the compositional structure and the existence of conditional expectation, it is essential to make this clear; different definitions would imply different results. I will make detailed comments about this in the last section, although this appears to be a minor point. I appreciate the effort of the authors in explaining the intricate details and implications of their main results, as well as the limitation that their methodology possesses whenever necessary. Main results and contributions of the paper are highlighted and presented clearly.

Relation to Prior Work: Authors make their contributions clear and discuss how their results are different/better than the most relevant prior work (references [12] and [24]). As I discussed in the weaknesses, authors include a sufficient selection of related work in the paper, but they did not explain the differences between their method and the prior work, as well as differences among the relevant literature. Although there does not exist a large amount of prior work with respect to CSO setting, it is essential to discuss relationship to prior work in details.

Reproducibility: Yes

### Review 2

Summary and Contributions: The paper describes an algorithm to solve "conditional" stochastic optimization problems. The main claims are around characterization of the rate of convergence of the presented algorithm (in terms of "sample complexity") under various conditions on the smoothness and lipschitz continuity of various functions involved.

Strengths: The optimization problem is interesting and worth studying for the NeurIPS community, especially as it pertains to MAML formulations. In addition, the authors bring up an important point that the degree of smoothness of the outer function in the composition can significantly alter the rate of convergence. However, significant problems exist in the discussion as I've outlined below.

Correctness: There are several seeming errors that need to be cleared up: - The gradient formula for F in lines 101-102 need further elucidation. Is g going as R^d -> R^d and hence should have a Jacobian matrix instead of a gradient vector (usually denoted by \nabla)? Or is g going as R^d->R, in which case f is a function of a scalar and its derivative is more commonly written as df/dx ? Please explain. - Note that you have specifically assumed that the inner expectation is conditional on the outer random variable. So, assuming that in both outer and inner expectation and gradient operator can be interchanged, we get that: \nabla F = E _{\xi} [ \nabla f ( E _{\nu|\xi} g(x) )^T E _{\nu|\xi} \nabla g(x) ] In other words, you have a situation where \nabla F = E_{\xi} [ h_1 (x,\xi)^T h_2(x,\xi) ] where the two functions h_1 and h_2 are possibly highly correlated because of the common conditional expectation. How do you go from this to the expression you have in line 101-102 where the expectations are independent? - Additionally the role played by the sample size m in the reduction of bias in the various theorems are unclear. The size m is the only control available on the bias of the gradient, and so it is natural to assume that m must be made progressively larger to drive bias to zero if even convergence itself should be assured. Why do some of your results (e.g. Thm 2) solely drive step size to zero and claim convergence?

Clarity: The english is perfectly clear but the concept I had a hard time with.

Relation to Prior Work: there are an entire stream of work on estimating conditional expectations for stochastic optimization from the applied probability and mathematical finance community that the authors miss. This is a good representative ref: https://epubs.siam.org/doi/abs/10.1137/18M1173186

Reproducibility: Yes

### Review 3

Summary and Contributions: This paper proposes a biased stochastic gradient descent (BSCD) algorithm for conditional stochastic optimization (CSO). The sample complexity of BSCD and the lower bound of sample complexity for stochastic oracle based algorithms are obtained in strongly convex, convex and weakly convex settings and when the outer function is Lipschitz continuous or Lipschitz smooth. The sample complexities of BSGD under general convex and smooth non-convex settings are tight. Numerical experiments demonstrate its superior performance over some existing algorithms for invariant logistic regression, model-agnostic meta-learning, and instrumental variable regression.

Strengths: (1) The CSO problem studied by this paper is significant since it covers important applications including MAML, planning and control in reinforcement learning, the optimal control in linearly-solvable Markov decision process, and instrumental variable regression in causal inference, etc. (2) The theoretical analysis is comprehensive as it covers a combination of 6 settings—whether the objective function is strongly convex, convex or weakly convex, and whether the outer function is Lipschitz continuous or Lipschitz smooth. (3) To my knowledge, this paper provides the first result of lower bound on sample complexity of stochastic oracle based algorithms for the CSO problem. (4) BSGD outperforms many relevant algorithms in the experiment.

Weaknesses: In “Yang, Shuoguang, Mengdi Wang, and Ethan X. Fang. "Multilevel stochastic gradient methods for nested composition optimization." SIAM Journal on Optimization 29.1 (2019): 616-659.” The authors propose the a-TSCGD algorithm that achieves sample complexity O(eps^{-4.5}) in the case of two-level composition (the CSO problem with 2 levels) in general non-convex and Lipschitz-smooth setting as shown in Theorem 3.2, which is lower than O(eps^{-6}) that achieved by BSGD. The sample complexity in convex and Lipschitz smooth setting is O(eps^{-4}) achieved by both papers. The sample complexity of a-TSCGD in strongly convex and Lipschitz smooth setting is O(eps^{-2.5}), which is worse than O(eps^{-2}) in this paper. Hence, the statement that “We propose the FIRST stochastic gradient-based algorithm for solving CSO problems and establish the FIRST sample complexity results of the algorithm under convex and nonconvex settings.” may not be true. I suggest the authors to clarify this and discuss their results and point out the technical difference. In table 1, the authors compare the sample complexities of BSGD and the SAA approach in [24]. However, in my understanding, they may have different meanings. The sample complexity of BSGD is the required number of samples for the BSGD algorithm to achieve a epsilon minimized of F, and the sample complexity of SAA is the required samples for the minimizer of the SAA problem to be an epsilon minimizer of F. I suggest the author to clarify and discuss this point. About reproducibility: Instrumental Variable Regression (IV): The approximate function h(w, X) is unknown.

Correctness: The algorithm is a correct generalization of SGD. The theorems are also well established. For the experiment implementation, I think it is valid if the same loss function and same approximator function is used among the algorithms in each experiment.

Clarity: Yes, I think the structure and sentences of this paper are very clear.

Relation to Prior Work: Yes, this paper clearly discussed its improvement from previous works.

Reproducibility: Yes

Additional Feedback: 2) There may be a typo in the final column of Table 1. Do you mean WC and Lipschitz? (3) The minimax error definition in Section 4.1 is inf_A sup_{phi} sup_F (error) while it is sup_{phi} sup_F inf_A … in Section 4.2. Why do you swap sup_{phi} sup_F and inf_A? (4) The caption of Figure 1: I think you mean $\sigma_a^2/\sigma_{\eta}^2$, yes? (5) I remember SAA is a finite-sum objective rather than an algorithm, yes? How do you obtain the solution to SAA in your experiment?

### Review 4

Summary and Contributions: The authors study conditional stochastic optimization, i.e. min F(x) = E_xi f_xi(E_eta|xi f_eta(x, xi) For this problem, it is not possible to obtain unbiased estimation of the gradient nabla F. The authors propose to solve it using biased estimators of nabla F, sampling at every iteration one xi and m (eta_i) \sim eta|xi. Thanks to the law of large number, the bias is controlled by the batch-size. The proposed algorithm is then simply SGD, using this biased estimator. The authors adapt the proof of averaged SGD convergence (in the strongly-convex/convex/weakly-convex case), Lipschitz/smooth, and find that the average bias in the gradient estimator directly appear in the bound. Similarly, they adapt lower-bound estimation of SGD complexity under classical oracle settings, and find that the maximum bias directly appear as an additive term in the bounds. The authors assess the perfomance of their simple algorithm on three CSO tasks: synthetic invariant logistic regression, MAML on a simple task and instrumental variable regression on a toy dataset. ----------- EDIT AFTER REBUTTAL: My concerns about novelty and more general related work on SA (that would yield the present upper-bounds immediatly) have not been addressed.

Strengths: The paper is fairly well written (although some additional insights would be welcome, see below). It tackles an interesting problem, albeit a well studied one. Lower bounds are not surprising, but their derivations are highly technical (in particular, they required to understand existing technical proofs). They form the essential contribution of this work.