Paper ID: 1211 Title: A Boosting Framework on Grounds of Online Learning
Current Reviews

Submitted by Assigned_Reviewer_10

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary:
This paper proposes boosting algorithms and analyze them from an online learning perspective. First, they propose a boosting algorithm based on the update of the online mirror descent(MABoost). Then they show a smooth version of MABoost, i.e., a variant of MABoost which creates only "smooth" distributions over examples. Further, they propose sparse or lazy versions of MABoost and show their convergence proofs. In particular, their result also implies an improved convergence bound for MadaBoost.

The convergence proof of MadaBoost is new and improved w.r.t. the original version.

But, the analysis based on online learning (or using Bregman divergence) is not new. For example, for TotalBoost(ICML06) and SoftBoost(NIPS07), Warmuth et al already used such techniques to prove the convergence of the algorithms.

The author considers minimizing training error only, which is not an appropriate goal. Obviously, the goal should be to minimize the generalization error. Some of smooth boosting algorithms can be applied to the boosting-by-filtering framework in which the learner has an access to the example oracle returning a labeled example drawn randomly according to the underlying distribution and the goal is to directly minimize the generalization error w.r.t. D. See, e.g., the paper of MadaBoost, GiniBoost or FilterBoost for the details. But, the framework of the current paper does not seem to be applied to the filtering framework.

An alternative way to minimize the generalization error is to prove that the boosting algorithm optimizes the (soft) margin. TotalBoost, SoftBoost or ERLPBoost (Warmuth et al. ALT08) do this. The proposed methods, however, does not follow the approach as well.

In Theorem 1, the authors compare two versions of MABoost for two norm and relative entropy. Apparently, the bounds have exponential difference. But, the distribution created using relative entropy is much skewer in general and thus the edges of hypotheses become smaller in practice. The author should mention it, too.

Also, an experimental comparison is desirable.

- There are some typos such as Appendix ??.

I think the paper is OK if the authors add a discussion on generalization ability of the proposed methods in the revised version.
The analysis is interesting and it motivates new boosting algorithms. But some explanation or discussion on the generalization ability should be added.

Submitted by Assigned_Reviewer_30

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents a new, quite general boosting algorithm based on mirror descent/ascent, together with the well-established connection between boosting and online learning. This result is then applied to various theoretical problems such as the derivation of a "smooth" boosting algorithm and an analysis of the MadaBoost algorithm.

As the paper acknowledges, the connection between boosting and online learning is not at all new. Nevertheless, the paper exploits this connection in novel ways that lead to interesting and significant new results. The analysis and algorithm are clean and general, and the applications very interesting. This seems like a significant advance in the development of general boosting-based methodology.

I think Theorem 1 is quite interesting. Even so, I am dubious about interpreting it to mean that this general algorithm is a boosting algorithm for the case given in Eq. (8). The bound does show that the training error epsilon goes to zero under the weak learning condition as T gets large. However, it appears to me that this bound is too weak to prove that the generalization error goes to zero, as required for strong PAC learning.

Related to this point, I would be curious to know what can be proved in general about the margins of the training examples for MABoost. If the margins can be shown to be large, then this would suffice to show that the algorithm is indeed a boosting algorithm.

The paper is entirely theoretical with no experimental results.

I very much liked the summary of the state of boosting research given in the introduction.

I think central concepts like smooth boosting should be described in the text, not in footnotes.

Page 2: I think another algorithmic framework for deriving boosting algorithms is given by the "drifting games" approach.

Line 105: I think the result in [13] only applies to a variant of LogitBoost.

Line 170: Should "x+i" be x_i?

Line 398: I'm not sure I understand the conclusion that Madaboost is slower than previously hoped. My understanding is that the new result (which is very nice) only proves an upper bound on this algorithm, not a lower bound. So isn't it still possible that a better upper bound could eventually be proved?

Line 415 (and elsewhere): "conjunction" -> "conjecture"

In general, the paper is badly in need of further proofreading, for instance, for numerous English mistakes and errors in the latex.
A significant advance in the development and theoretical understanding of boosting-based methodology.

Submitted by Assigned_Reviewer_36

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This work describes an approach for obtaining boosting algorithms from the mirror-descent algorithm for (online) convex minimization. The algorithms that are derived have provable guarantees in the PAC framework (that is given a weak learner with advantage \gamma it produces an \epsilon-accurate hypothesis). The framework is useful when boosting needs to be performed with additional (convex) constraints on the distribution since those can be obtained by changing the Bregman projection step. In particular, smoothness and sparsity are analyzed. The paper also claims to contribute new analysis of mirror-descent with lazy updates.

This paper is a nice generalization of Kivinen-Warmuth 99 and Barak-Hardt-Kale 09 papers that only dealt with KL divergence to any Bregman divergence. The applications are nice but none of them seems both interesting and novel. Smoothness is certainly a useful property (e.g. for agnostic learning) but this case the approach to obtaining smoothness is identical to that in the BHK09 paper. In the case of sparsity the discussion of it is rather non-specific. No clear motivation and also no theoretical analysis of sparsity are given. The result only shows that it is possible to add an additional l_1 constraint into the boosting analysis. Without motivation and analysis it's impossible to tell for example how it would compare with subsampling.

At the technical level the results seem nice but rely primarily on the well-known connection of boosting and online learning/optimization and known analysis of mirror-descent. The paper mentions a number of "second-order" improvements to aspects of the analysis but I'm not familiar enough with those to judge the importance and novelty. Overall I think that the paper has interesting technical content which should be published. At the same time I expect more novelty (either conceptually or technically) for a solid recommendation to NIPS.

053 - The term smooth boosting was first defined by Servedio, COLT 01 not Gavinsky (but smoothness was used even earlier by Jackson, FOCS 94)
095 - The bound on the number of steps lacks the \gamma_t term
098 - Smoothness is also shown to be useful for agnostic boosting by Gavinsky, ALT 02 and by Feldman, ICS 10 for agnostic and PAC learning models.
138 - It seems that later on you are using a different definition of entropy
156 - is -> are
169 - "norm of"
170 - typos in the definition of entropy
171 - what are x_i' here?
279 - 3 typos
415,419 conjunction->conjecture
441 - M.Long is not an author of this article (P.Long is the editor)
This paper presents a technique for obtaining boosting algorithms based on a generalization of some previous approaches. The results appear to be new and are useful to know but are technically fairly standard.
Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We appreciate your encouraging and useful comments. The corrections you mentioned should be taken into account in the final version of the paper.

As mentioned, MABOOST is presented in the boosting-by-sampling setting rather than in the boosting-by-filtering setting because: MABOOST is more general. Filtering setting only accepts a subset of smooth boosters while MABOOST as a framework can generate smooth and non smooth algorithms. In fact, some of the smooth boosters derived from MABOOST framework can be used in the filtering setting. For instance, using the sum of binary entropies as the Bregman divergence in the lazy version of MABOOST results in a logitboost-type smooth boosting algorithm. Following the techniques presented in Madaboost [Domingo2000] it can be straightforwardly converted to a boosting-by-filtering algorithm.

Another important point is (though not directly mentioned in the paper) MABOOST maximizes the minimum margin of the examples or in other words, it is a maximum margin framework. Thus, theoretically, it minimizes the upperbound of the generalization error. Admittedly, a line or two are needed to mention this property of MABOOST. It is quite easy to show it. In the proof of theorem 1, select w_* to be a vector with only one non-zero entry corresponding to the example with the worst margin (minimum margin over the examples). Then in inequality (13) the first term on the left is the margin and as can be seen with our choice of etha (etha_t=gamma_t/L), it converges to lambda_min (minimum edge). From [Rätsch & Warmuth,2005. Efficient...] or (first equation in TotalBoost) we know that lambda_min is the maximum margin which can be achieved. Thus, MABOOST is the maximum margin algorithm.

Eventually it is noteworthy that the analysis and the corresponding techniques used here such as choosing appropriate w_*, beta-strongly convex functions and double-Bregman projections are new in boosting context and has not been used before in Totalboost or other works. This is the first work that shows famous online learning algorithms are in fact boosting algorithms in a rigorous way.