Review for NeurIPS paper: Walking in the Shadow: A New Perspective on Descent Directions for Constrained Minimization

NeurIPS 2020

Walking in the Shadow: A New Perspective on Descent Directions for Constrained Minimization

Review 1

Summary and Contributions: The paper provides a new perspective on different descent directions used in Frank-Wolfe algorithm by introducing the best local feasible direction of descent, named as shadow step. This direction is locally the steepest feasible direction, which has rich connection to existing descent directions such as the standard FW step, away step and pairwise step. A new algorithm is provided based on the shadow step, achieving linear convergence rate for strongly convex constrained problems on polytopes.

Strengths: The paper is well organized and provides new perspective and understanding on existing descent direction of Frank-Wolfe algorithm. In particular, the Frank Wolfe vertex can be viewed as a special case of following the locally feasible steepest descent direction, when the face minimizing the gradient is degenerate to one vertex. This suggests that following the shadow step along the projection trajectory can be beneficial, which turns out to enable linear convergence result while the standard Frank-Wolfe step can not. Comparison with away step or pairwise step are also included, illustrating intuitions to different choice of descent direction. Hence, the paper is theoretically solid and insightful.

Weaknesses: My major concern is regarding the computational complexity of the oracle of the shadow step. Empirically, it seems like the evaluation of this oracle is more expensive compared to standard descent directions, is it possible to provide a theoretical analysis regarding the required complexity of evaluating different descent directions? If it is more computational expensive, it might reduce its impact in practice. It may be more preferable to reduce the frequency of evaluating the shadow directions, as in Shadow-CG, each iteration still need to evaluate at least one shadow direction. Moreover, as mentioned in Theorem 4, the FW step can be viewed as the infinity projection, meaning that taking directly the FW step can be computational more efficient in some cases, without evaluating the intermediate break point and shadow directions. This raises the question about when the shadow step is more desirable than the FW step. As far as I understand, shadow step provides local information while FW step provides some "global" information. Unless the iterate already lies in the face of the optimum, in which case shadow step can help reducing the zigzag, I do not see a clear reason why the shadow step is more preferable. Besides the above mentioned comparison, I find the claim regarding geometric constant independent linear convergence rate a bit over-claiming since the number of breakpoints could be exponential in the dimension which is worse than the polynomial dependence in pyramidal width, please comment on it. Overall I am very positive about the paper because of its new perspective and understanding.

Correctness: I have quickly gone through the proof, the analysis is sound to the best of my knowledge.

Clarity: The paper is very well written and easy to follow.

Relation to Prior Work: A through and rich discussion on the related work is included.

Reproducibility: Yes

Additional Feedback: == EDIT after author's response == The author's response provides detailed explanation and clarification on my concern. Even though not perfect, I believe the paper provides a new perspective and understanding on existing descent direction of Frank-Wolfe algorithm and I will maintain my score supporting publication of the paper.

Review 2

Summary and Contributions: This paper studies the choice of descent direction of each iteration for constrained minimization problems. This paper proposes to combine the ideas of FW steps and shadow steps into a novel SHADOW-CG method. The analysis shows the shadow of the gradient is the best local direction of descent and the proposed method converges linearly under certain conditions.

Strengths: The presentation of the paper is smooth. The intuition of shadow steps is well explained. The theoretical properties of the proposed method have been carefully studied.

Weaknesses: The organization of the paper is not well-balanced which may due to the limitation of space. The main method was introduced near the end of the paper and did not have enough space to fully discuss it. Besides, the paper did not provide sufficient numerical experiments to demonstrate the advantage of the proposed method. The numerical comparison results in Figure 2 seems to suggest the Shadow CG method converges with a small number of iterations but takes a longer wall clock time to converge compared with some existing methods.

Correctness: The proposed method as well as the theoretical analysis looks solid to me, though I did not have enough time to carefully check the proofs in the supplemental file. This paper only briefly presented a numerical experiment for a real dataset that does not provide strong evidence to support the empirical performance of the proposed method.

Clarity: In general, the paper reads smoothly. I suggest the authors to re-organize the structure of the paper since preliminaries and preparation took too much of the space while the main methodology was not fully developed and discussed in the main domucent.

Relation to Prior Work: The introduction discussed the relation of this paper with existing methods. More numerical comparisons with existing method are encouraged.

Reproducibility: No

Additional Feedback: ######## EDIT after author's response ############# I have carefully read the rebuttal letter. The authors' responses have addressed many of my concerns. I would like to keep my positive score and support for acceptance. ######## Original response ############# Please see my comments above.

Review 3

Summary and Contributions: The paper offers a new framework to understand PGD, Frank-wolf step, away-steps in a unified way by looking at their relationship with the descent direction defined in a limit way which they call shadow of the gradient. The connection is interesting, intuitive and natural. The Shadow-CG method they design using the insights has good linear convergence while also enjoying good performance in practice as they show through experiments.

Strengths: The theoretical part of the paper is rather sound. The empirical evaluation works on real-world data which demonstrates the effectiveness of method in a solid way. The attempt of a unified understanding in the CGD area might further the design of conditional gradient methods for constrained optimization problems.

Weaknesses: I will appreciate the work more if it addresses the following of my concerns: + explain why the assumption of a trace oracle / shadow oracle is natural and in practice feasible; also why the iteration cost in Trace step would not dominate + give a more direct comparison with prior algorithms in terms of theoretical guarantees in the main paper. In particular, for what kind of problems would the method be in more favor? + the design of Shadow-CG has used a combination of heuristic and claims in prior work. The proofs prior to the design of algorithms seem more to be explaining how to understand different steps under the new framework, instead of motivating the design of algorithms. In that sense the flow of paper seems less coherent. + the reason why Shadow-CG is superior over Shadow-Walk has only been explained in a hand-wavy sense - is there more theoretical reasons available. Also, in experiments the author corroborated this point but the runtimes of algorithm in the plot don't seem to suggest the same, right? Update: The authors feedback has addressed most of my concerns. My only comment then will to be writing the paper in a clearer way to avoid confusion when making those comparisons.

Correctness: The method and proofs look correct. The empirical methodology serves well to corroborate the theoretical claims.

Clarity: I have enjoyed reading the paper. Lots of proofs have been omitted in the main parts but there have been enough words to connect parts and motivate the next steps so that it is easy to follow. The only thing I think might be helpful is to include the main theorems (Theorem 6) and Theorem 8 that shows the theoretical guarantee of the main algorithm as proposed.

Relation to Prior Work: Enough results from prior work has been adequately quoted in the paper, following by a detailed description of the different yet relevant methods and their theoretical guarantees. I would change my view on the paper if the authors are able to: + give some concrete applications when their complexity is better than all of the prior (extending the argument on line 311-313) theoretically or + show how the framework can be used to unify and more elegantly derive a subset of the prior CG / FW results they have mentioned.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: In this work, the authors provide a framework to understand the different movement directions (Frank-Wolfe vertices, away steps, projected gradient steps) in the constrained optimization setting. The authors shows that the best direction of descent is the directional derivative of the projection of the gradient which is equivalent to projected gradient descent in continuous time. Further, they show that the Frank-Wolfe vertices are in fact the projection of an infinite descent in the negative gradient direction. In the end, authors also provide a SHADOW-CG method combining FW steps and shadow steps which achieve linear convergence for strongly convex objective, independent of pyramidal width.

Strengths: A very interesting paper indeed. The paper provides an interesting insights to view different descent direction with the lens of shadow of the gradient as proposed in the paper. To me, the result sounds very interesting and novel. The main claims in the paper are: 1. Curve correpsonding to the projected gradient update as a function of step size at a particular x is a piecewise linear curve and the break point occurs where normal cone changes. 2. Norm of the shadow step of the gradient is zero only at the optimal point and can be used to bound the primal gap without any dependence on geometric constants. The authors show that the single projected gradient descent step can be approximated by multiple shadow step which essentially allow the algorithm to skip the projected gradient step. In fact, the authors proposed an approach to trace the whole piecewise linear curve g(\lambda). 3. The authors provide connections between shadow of the gradient descent directions in conditional gradient variants. In lemma 3, they prove that shadow of the gradient is the best local feasible direction of descent which is intuitive. They also show that the end point of the curve g(\lambda) is in fact the FW vertex. Though it seems a nice connection, it is unclear that what does it mean intuitively. 3. In the later section of the paper, the authors prove the convergence result for the proposed shadow CG algorithm which uses the result from the previous sections. The linear rate of convergence is achieved with no geometric constants.

Weaknesses: The paper is hard to read. It took me multiple read to understand the paper. An explanation after each theoretical results would make this paper more readable and useful in my opinion. Then, there are few other minor comments: 1. What does it mean (Line 247) - "FW vertices are therefore able to wrap around the polytope maximally compared to any projected gradient method and serve as an anchor point in the projections curve." 2. Though the rate obtained for shadow CG algorithm does not have dependence on geometric constants like pyramidal width, but depends on the number of facets. This dependence in number of facets I believe is because of using the Trace algorithm (algorithm 3). As shown in Theorem 6, number of facets is exponential in m. However, this exponential dependence doesn't show up in the experiments but I believe there must be cases where this would really blow up. Similarly, the pyramidal width I believe is different than the number of facets and there would be cases where the geometry is not skewed and FW would converge faster in practice. Now, since the shadow-CG algorithm uses the FW steps in the beginning and Trace step only kicks in when the FW-direction oscillates a lot between atoms and the inner product between the gradient and FW direction becomes small. So I am wondering that would the algorithm always take the best of two worlds in practice ? I meant if the geometry is nice then it is not a nice idea to go for trace tracking step even once. Though line 5 of the algorithm seem to check that but is there a more safe way to check ? 3. Similar to the previous point if the algorithm goes to trace tracking step every time then I guess this increases the computation of each update by 2^m in the worst case. Can there be an efficient way to avoid this computation on average. ? 4. Connection between shadow walk and shadow cg algorithm is not clear. Though they seem to have the same rate of convergence, the shadow walk algorithm seem to perform slower in practice. What could be the reason behind it ? ------------------------------------------------------------------------------------------- Thanks for the response. I would like to keep my score unchanged and would love to see this paper published.

Correctness: Seems correct.

Clarity: The writing of the paper can be improved. In its current state the paper is hard to read while reading for the first time.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: