Paper ID: | 5785 |
---|---|

Title: | High-Dimensional Optimization in Adaptive Random Subspaces |

Post-rebuttal update: The author's rebuttal addresses my (minor) concerns well, and my overall score remains the same. ---- This paper presents a principled randomized optimization method for high-dimensional convex optimization via a data-dependent random linear sketch. The approach is similar to earlier work such as: - M. Pilanci and M. J. Wainwright. Randomized sketches of convex programs with sharp guarantees. IEEE Transactions on Information Theory, 61(9):5096–5115, 2015. - Y. Yang, M. Pilanci, M. J. Wainwright, et al. Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3):991–1023, 2017. that also solve a sketched version convex optimization problems. The main innovations here are to extend this sketching technique to a wider class of convex objectives and to introduce a data-adaptive sketching technique that greatly improves the error bounds on the solution relative to a data-oblivious sketch. The proposed technique can also be performed iteratively to improve the accuracy of the solution without having to change the sketch matrix, so the sketch on the data only has to be performed once. Overall, I thought this was a high-quality paper. The proposed sketching approach has clear practical benefits and comes with strong theoretical guarantees. The adaptive sketch approach is particularly appealing, and has clear benefits over an oblivous sketch, and may be of broader interest to the optimization community. The paper is generally well-written and clear. I only have some minor comments, given below: I think the title should be changed to “High-dimensional Convex Optimization in Adaptive Random Subspaces” (i.e., add the word “Convex”) to better reflect the contents of the paper. Namely, the approach hinges crucially on convex duality so it seems limited in applicability to convex optimization problems. One limitation is that the optimization model (1) studied in this work assumes the loss function f is strongly smooth, with rules out L1 losses among others. Also, the model (1) can only accommodate L2 regularization of x, and not other common types of non-smooth regularization used in regression settings, e.g., L1 in the lasso or L1+L2 in the elastic net. Some discussion about whether the present approach could be extended to this more general case would be nice. I see there is a section E in the appendix that addresses this some, but this section does not appear to be discussed at all in the main body. At the start of Section 4, the paper states “By the representer theorem, the primal program (1) can be re-formulated as…”, but I do not think invoking the representer theorem here is necessary. If I understand correctly, the equation in (11) is just due to a linear change of variables x = A^T w. The results on classification of CIFAR10 in Table 3 are far from state-of-the-art (~50% classification error). Is there are different problem setting beyond classification for which the method could be demonstrated? The notation 5.10^{-5} is a little non-standard. Maybe use a \cdot or \times rather than a period?

This paper proposes a new randomized optimization method for solving high-dimensional problems. The paper is well written and the authors demonstrate the efficacy of their methods with strong theoretical guarantees and extensive experiments to back up their claims. The authors use standard results from Fenchel duality of the dual problem and its adaptively sketched counterpart to bound the relative error with high probability. The relative error bounds are shown for high dimensional matrix with finite rank, exponentially and polynomially decaying eigenvalues. The adaptive sketching method significantly outperforms its oblivious sketching counterpart for different decay values. The sketched matrix is computed exactly once and with probability 1 achieves better condition number. However, computing it involves a matrix inversion step that needs m^3 computations, is there a way to approximate it for problems which might need large m. Moreover, will the approximation still satisfy Proposition 3 upto a certain multiplicative factor? On the contrary, for small m, will dynamically modifying the sketching matrix lead to even tighter bounds? While the experiments indicate that sketch based methods indeed converge quickly to their optimal solutions, it is unclear if SGD (without sketching) converges to the same solution or if it performs better (albeit after a longer time). It will be interesting to compare the test accuracy vs wall-clock time of MNIST and CIFAR10 with oblivious sketching and other methods The authors claim that the sketch method did not work well with SGD, did the authors experiment with relatively large mini-batch sizes which may resolve the issue of not approaching the minimizer accurately. Post rebuttal: In light of the rebuttal posted by the authors, I would like to keep my current score.

Update after rebuttal: Thanks for clarifications on SGD and other baselines. Please make sure to add these experimental results in future versions of the paper. ---------------------------------------------------------------------------------------------------------- Most of my comments are on the empirical results: 1) Experiments in Section 5: While the experimental results in Section 5 suggest that the proposed technique is fast compared to direct optimization of the original objective, it is not clear how much it improves over the baselines such as oblivious Gaussian sketching. Since classification accuracy is used as the error metric (and not parameter recovery error), it could be possible oblivious sketching will have similar performance as adaptive sketching. So the authors should compare their approach with other sketching baselines and Nystrom method. The authors also point out that SGD didn't work well for adaptive sketching. This sounds slightly concerning because, for really high dimensional problems, where one has to use a large sketching dimension, Newtons method would be slower than SGD. So, it'd be great if the authors perform more experiments on the robustness of other baselines and the proposed method to the choice of optimizer used. What value of lambda is used for synthetic experiments in Figure 1? It looks like lambda = 0 is used. If this is the case, logistic regression on separable data wouldn't have a minimizer (the norm of its iterates diverges). So shouldn't the relative error keep diverging as well? 2) While the theoretical results look interesting, it is not clear why this particular form of adaptive sketching works. So it'd be great if the authors provide some intuition for why the proposed approach works. 3) Clarity: I feel the organization of the paper could be slightly improved. Some parts in the paper which are not so important (e.g., remark 1) can be pushed to the appendix and more discussion can be added to section 4. It'd be good if the authors provide some discussion on how the proposed method compares with other techniques for accelerating kernel methods such as random fourier features, nystrom methods.