NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1204 Blended Matching Pursuit

### Reviewer 1

EDIT I was pleased with the authors answer about the duality gap. I encourage them to include this paragraph in their revision. I hope the public implementation of the code will be released for reproducibility. I have updated my grade from 6 to 7. ###################"" The article reads quite well, with numerous examples and explanations of convex optimization concepts. The methodology is strongly inspired from blended conditional gradient. The experimental validation is dubious to me as for loops are usually costly in python. The authors should at least use numba and @njit to use just in time compilation. How are the entries of X sampled ? If the design is near to orthogonal (eg if the entries are iid centered gaussian), then the problem is very easy and the figure may not reflect reality. Why not use real life datasets, eg from LIBSVM, and simulate only x^* and w (the observations y in these datasets could also be used)? I disagree with the sentence L15. Sparsity of the **solution** is important, but sparsity of the **iterates** does not affect generalization nor interpretation. Getting the sparsest solution is also important in the case when there are multiple ones, but that is not clear in this sentence. L35: which algorithm needs RIP? Usually RIP is needed to analyze the quality of the solution to Pb (1), not to analyze the convergence of an algorithm. Fact 2.1: please provide a reference. L83 I think some hypothesis is missing, f = indicator of {a} is strongly convex, but nabla f (x*) does not exist. The quantity \phi_t was not detailed enough in my opinion. How is it a dual gap estimate ? What is the reason for dividing it by tau at every dual step, and why are dual step coined like this? Finally, the code cannot be run on my machine: 363 364 # if still nothing --> 365 return _, 'FN', _ 366 367 def weak_sep(c, x, phi, kappa, S, D): NameError: name '_' is not defined And it seems that dual steps are never taken before: In [1]: %run bmp_code.py ********** η = 100 ********** Total CPU process time: 30.582956643000003 Number of constrained steps: 73 Number of dual steps: 0 Number of full steps: 39 ********** η = 10 ********** Total CPU process time: 30.691314241999997 Number of constrained steps: 41 Number of dual steps: 0 Number of full steps: 81 ********** η = 5 ********** Very cosmetic: L97 you have two different spelling for Lojasiewicz Algo 1 L5 L6 the authors may want to define \DeclareMathOperator{\argmin}{\mathrm{arg\,min}} to avoid the space appearing between arg and min because of the long text under. L52 can't it just be \lambda_j ? The notation H* L66/ R^n* L102 is confusing and I suggest the authors find a better one.

### Reviewer 2

=== After the author's rebuttal, I decide to keep my score. The paper proposes a blended matching pursuit (BMP) algorithm, which combines weak-separation oracle, gradient descent and gradient matching steps to solve a smooth convex minimization problem restricted on a linear space spanned by a dictionary. Under various orders of smoothness and sharpness of the objective function, sublinear or linear convergence rates can be established for the proposed algorithm. Numerical experiments on synthetic data sets are conducted to show that BMP can achieve fast convergence and make solutions sparse. Overall, the work is interesting and important in applications with complete convergence discussion and supporting numerical results. However, the proposed algorithm seems the adaptation of the work by Braun et al. [2019] from a convex hull to a spanning linear space of a specified dictionary, which limits its novelty. In addition, computational experiments are insufficient to justify the claimed fast convergence, which requires more in-depth discussions, e.g., the influence of the matrix size $A$, robustness to the noise, and sensitivity of sparsity levels. Other comments are shown as follows: 1. In the abstract, it claims that coordinate descent-like steps and stronger gradient descent steps which are unclear in the analysis of Algorithm 2. 2. In Problem (1), the domain for the variable $x$ should be changed to $span(\mathcal{D})$ according to the context. 3. In Lemma 2.3., the notation $(R^n)^*$ is a little confusing and could be replaced by another symbol. Likewise, $H^*$ could be confused with the dual space of $H$. 4. In line 231-232, is there any evidence to justify this statement? 5. In Section 3.1, strongly convex cases are skipped in convergence analysis, which should be commented at least. 6. Minor typos: In line 25, "lazified"? In line 40, "reader" -> "readers". In line 163, "Similarly" -> "Similar".

### Reviewer 3

The main idea and the motivations are explained clearly. The experimental illustrations are convincing too. Some clarifications may help understanding the paper better. The dual gaps are not well defined within the main paper, which are being used in BMP to choose the steps. It makes more significant to discuss them in the main paper to appreciate the contrast better w.r.t. the referred paper (Braun et. al., 2019).