NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The paper is well organized and written. First an important problem is discussed, then theoretical bounds are given to address this problem, driven by which a new algorithm is proposed. However, there are some questions need more explanations. (1) Why do you restrict the analysis to the hypothesis families of regression trees and say your results can be extended to other families? (2) Although it says running time of RGB per thread is comparable to that of standard GB, I’m worried about the efficiency of RGB. It would be better if running time of the experiments are given and compared. (3) There is only one algorithm GB compared with RGB. If RGB is able to beat many other algorithms across different datasets, it would be considered to be more practical.
Reviewer 2
originality: Incremental. Guided by the theoretical results, this paper proposes to find a function by Randomized Coordinate Descent in each step of boosting. quality: The theoretical part and the proposed algorithm are sound. However, the proposed RGB is much slower than the original GB. clarity: This paper is well written. However, the theoretical part is not easy to follow. significance: Not clear, due to the experiments are in small toy data (UCI data), no results on large datasets. cons: 1. My most concern is the speed of RGB for it will search multiple trees in each iteration. And I do not think Line 271-273 is correct, for the single tree learning in GBDT could be in parallel, such as XGBoost and LightGBM. Therefore, RGB will be much slower than GB. And experiment should have comparison over speed as well. 2. The used datasets in this paper are all small datasets from UCI. It would be better to have a comparison on large datasets, for we don't use small data in most real-world tasks.
Reviewer 3
Gradient boosting (GB) has been extensively studied in the past, both theoretically and experimentally. Recently, with the advent of big data, several accelerated versions of vanilla GB have been proposed (in particular the well known XGBoost), and while the experimental evaluations of these methods have been abundant, the same cannot be said for the theoretical analysis. In this paper, the authors tackle this important problem. The main contribution of this paper consists in casting the various accelerated GB methods in a regularized gradient boosting setting. Indeed, by introducing a regularization term in the usual minimization objective of GB, it is possible to recover most, if not all, of the various accelerated gradient boosting approaches (XGboost included), while at the same time opening up several interesting and exciting possibilities for deriving new/novel acceleration strategies. The second contribution is a full theoretical analysis of the generalization error of regularized gradient boosting using the Rademacher complexity of the hypothesis space. To the best of my knowledge, this is the first bound on the generalization error custom tailored for accelerated gradient boosting approaches. The third contribution comes in the form of a theoretically justified learning algorithm, coined RGB. The authors make perfect use of the theoretical guarantees of randomized coordinate descent in order to justify and derive the novel method RGB. The paper is also very well written and easy to follow. The main contributions are clearly presented. As far as I could tell, the theoretical justifications are correct. My main remarks concern the experimental section. While I agree that the experimental results are clearly encouraging, as they show a certain advantage of RGB over XGBoost, I'm not quite sure I clearly understand the authors' claim in lines 307-311. In particular, I don't think that such bold conclusions can be inferred from the results presented in the paper, unless I'm missing something. Also, in line 303 the authors claim that only RGB fails only on diabetes, yet the results of Table 1 show that RGB fails on mnist49 as well, which implies that the hypothesis given in lines 304-306 is not realistic. All in all, a very interesting contribution, proposing a theoretical view of accelerated (regularized) gradient boosting.