|
Submitted by
Assigned_Reviewer_4
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper studies the idea of using a boosting-like
approach to directly minimize the training error and various functions of
the training margins. The algorithms are explained in detail, and decent
experimental results are presented. Theory is fairly minimal.
In a
way, this is a fairly obvious idea, but conventional wisdom says that the
idea should not work since classification error is not convex or even
continuous. It is great to see someone try it, and to spell out all the
issues involved, and the details of how to implement it. I thought the
experimental results were especially strong, a nice comparison on 10
datasets against several boosting algorithms. The new method works
surprisingly well, especially in the presence of noise.
The
presentation is generally good. I thought the descriptions of algorithms
could have been a bit more precise, but the paper gives examples that
really help to illustrate what is going on. Q2: Please
summarize your review in 1-2 sentences
A good idea presented with algorithmic details, and
strong experimental results.
Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
In this paper the authors provide an algorithm for
directly minimzing 0-1 loss and margin maximization. Most existing machine
learning techniques have relied on minimizing a convex upper bound on the
0-1 loss in classification problems. In contrast, in this paper the
authors propose a simple greedy algorithm for directly minimizing the 0-1
loss via a combination of weak learners. This is followed by a few steps
of direct maximization of margin. The proposed algorithm is then evaluated
on a few small low dimensional datasets.
I have the following
major concerns about this paper:
The authors claim that their
algorithm has a much favorable run-time compared to AdaBoost (Table 2). I
do not understand how this is possible. In AdaBoost, in each round, a
weight is given to a training example and a weak learner such as a
decision tree can be found quite efficiently to minimize a weighted loss
(for example using CART as the weak learner). Thus, in each step of
boosting one needs to find only one decision tree. However, for the
proposed algorithm, one has to iterate over all possible weak learners. I
just don't see how the proposed algorithm can be computationally more
efficient unless the number of weak learners is really small. In many
applications with large datasets it is typical to consider decision trees
of depth 5 or 10 and I do not see how the proposed method can be efficient
in that case if one has to enumerate all weak learners. I am concerned
that the proposed algorithm is limited to small datasets with low
dimensions and weak learners with very few instances.
From the
standard deviations in Table 1, I am not sure that the proposed method
results in statistically significant results in Table 1 compared to
AdaBoost. If we take the standard error most of the results seem to
overlap.
The proposed method consists of two steps: first, 0-1
loss is minimized using greedy co-ordinate descent. Once a few weak
learners are selected, a few more weak learners are added to maximize the
average margin of bottom n' examples greedily. How much of the claimed
benefit is due to each step? For example, what happens if you run your
algorithm 2 after a few iterations of AdaBoost? "Direct 0-1 loss
minimization" is a bit of a misnomer since it is followed up with a few
steps of margin maximization. I am not sure such natural questions are
adequately answered in this paper.
Q2: Please
summarize your review in 1-2 sentences
While this work could be potentially interesting, I am
not completely convinced about this paper at this
point. Submitted by
Assigned_Reviewer_8
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a boosting method that directly
minimizes the empirical classification error that is defined based on the
indicator whether predicted y = observed y, the so-called 0-1 loss. The
proposed method first runs greedy coordinate descent to search the
coordinatewise minimum loss, and then runs coordinate ascent to expand the
margin. The method is interesting and novel, and offers some advantages.
The following concerns are raised.
1. It seems the authors assume
a hypothesis space which has a finite number of hypotheses because in the
algorithms 1 and 2, it loops on all hypotheses h at each iteration. For
instance, algorithm 1 finds all weak learners that lead to largest
classification error reduction at each iteration. What if users choose a
hypothesis space, such as linear functions?
2. At the end of
Algorithm 1, what is the rationale to update the weight of one weak
learner that gives the smallest exponential loss? Isn’t it still using a
convex loss function although in the early part of the algorithm, it uses
0-1 loss.
3. It also assumes that data is always separable as long
as the combined strong classifier is strong. What if it is not the case?
What if the chosen hypothesis space is not complex enough to separate a
given dataset. The paper states that algorithm 1 reaches a solution of
coordinatewise minimum. It also says the 0-1 loss reaches 0. What is the
definition of coordinatewise minimum? Is it actually a global minimum
because the lowest error would be 0.
4. When the second step
(margin maximization) starts, it starts from a region that would not get
the solution out of the 0 loss region, characterized by the value of d. It
is not clear how exactly this value of d is calculated instead of simply
saying “determine the lowest sample whose margin is decreasing”.
5. Given separable cases are assumed, the margin is always
positive in their formulation. What happens if there are negative margins
for inseparable cases? The second part of the algorithm would not run.
6. In the early paragraph of page 2, it says “it might escape the
region of minimum training error in order to achieve a larger margin”. In
the later section, the design of algorithm 2 will not allow the margin
maximization step go beyond zero 0-1 loss region.
7. Assuming a
weak learner can be obtained in a similar computational cost, DirectBoost
is certainly more computationally heavy for each iteration than AdaBoost.
In their experiments, what is the stopping criterion that they used for
AdaBoost? Why does AdaBoost need so many iterations?
8. It is not
that clear about the truly convincing advantage of the proposed method
over regular boosting methods.
Q2: Please summarize
your review in 1-2 sentences
This paper presents a boosting method that directly
minimizes the empirical classification error that is defined based on the
indicator whether predicted y = observed y, the so-called 0-1 loss. The
method is interesting and novel, and offers some advantages although there
are some concerns. Submitted by
Assigned_Reviewer_13
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Simple idea with nice results. If this is
effectively new, then the paper should be accepted. Therefore the
novelty aspect should be carefully checked.
Is it really a
boosting algorithm? To me it should be called a pursuit algorithm because
it does not apparently rely on manipulating the distribution of examples.
This is important because there are closely related works (see kernel
matching pursuit,
http://www.iro.umontreal.ca/~vincentp/Publications/kmp_techreport2000.pdf,
section 3) found in the pursuit literature instead of the boosting
literature. This is close, but not exactly the same, though...
About the overlap with papers 539 and 956. I was initially a
reviewer of paper 956 which describes a multiclass variant of the same
idea. Paper 956 is harder to understand and could have been a paragraph in
paper 481. Paper 539 is a semi-supervised variant od the same idea and
could have been a paragraph in paper 481. This splitting of a work into
minimal publishable units is unwise. I think that the community will be
better served by accepting 481, rejecting 539 and 956, and letting the
author know that he can refer to these extensions in the final version of
the Nips paper and defer the full discussion to the future journal paper.
Detailed comments:
[page 1, line 44] -- Even with
surrogate losses, the problem is not always convex. This depends on the
family of functions. Q2: Please summarize your review in
1-2 sentences
Recommend accepting only 481 which contains the
important idea.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
For Reviewer 13: Thank you for your review. You
are correct that our method doesn’t maintain a distribution of samples, so
it is similar to a matching pursuit algorithm in this sense. However,
according to Hastie's book [1], page 605, "... Boosting was initially
proposed as a committee method as well, although unlike random forests,
the committee of weak learners evolves over time, and the members cast a
weighted vote.” Clearly our method is a boosting method in this sense. We
haven't get a chance to show our method is a boosting algorithm in the PAC
learning sense, but the term "boosting" has come to be used for a range of
algorithms, and a lot boosting algorithms do not possess the PAC-boosting
property. Thank you for your comments about related papers 956
and 539. You are right that surrogate functions are not always convex.
For Reviewer 4: Thank you for the valuable feedback. We do
have theoretical results of the generalization error bound in
terms of the average margin of bottom n′ samples or n′th order
margin of training examples. So we add a statement "Instead, we
can prove that the generalization error of any ensemble classifier is
bounded in terms of the average margin of bottom n′ samples or n′th order
margin of training examples, as well as the number of training
examples and the complexity of the base classifiers."
For
Reviewer 7 Thank you for your comments and concerns. Due to space
limitations, we were only able to give very brief descriptions of building
trees in lines 309-313 in the paper. Similar to other boosting methods,
DirectBoost uses a greedy top-down partition strategy in each step to
recursively find a single tree; it does not need to enumerate all trees.
For each splitting node, our algorithm looks for a split that leads to
better 0-1 loss or margin objective. When we use binary split, our
algorithm has the same computational complexity as AdaBoost with CART as a
weak learning algorithm. When we use multiway split, our algorithm has
higher computational complexity, but as pointed out by Hastie, page 311 in
[1], binary split is preferred nonetheless. Therefore we used binary split
for all boosting algorithms in our experiments. As can be seen in Table 2,
for each iteration, DirectBoost is slightly slower than AdaBoost. On
lines 340 and 346, we describe the stopping criterion for AdaBoost and
DirectBoost, respectively. Table 2 shows the running times. Though
DirectBoost is slower in each iteration, it requires much fewer iterations
and is therefore more efficient than AdaBoost overall. In Table 1,
comparing DirectBoost^epsilon_avg to AdaBoost, many of the results are
overlapped if we consider the standard error across the folds. However,
DirectBoost_avg^epsilon gives better results, on average, in 9 out of 10
data sets, which shows the effectiveness of our methods. Algorithm 1’s
objective is to minimize 0-1 loss, and algorithm 2 looks for a model with
an optimal margin objective under the condition of minimum 0-1 loss in a
certain sense. While algorithm 1 can be substituted with AdaBoost, the
problem of AdaBoost is that when the data is not linearly separable, it
does not have a good stopping criterion. Algorithm 1 stops at a
coordinatewise local minimum. “Direct” in our title is an adjective
not only for “0-1 loss minimization” but also for “margin maximization.”
For Reviewer_8 Thank you for your comments and questions. In
this paper, we only consider the cases, such as decision trees, where the
hypothesis space is finite. We will consider how to extend to the case
where the hypothesis space is infinite as future work. When there are
several weak learners with the same 0-1 loss, we choose the one with the
smallest exponential loss; when we identify an interval with the same 0-1
loss, we can choose any point in the interval, such as the midpoint.
However, we found empirically that choosing the point with the minimum
exponential loss in the interval leads to better results. The
definition of coordinatewise local minimum can be found on page 479 in
[2]. When the data is not separable, algorithm 1 reaches a coordinatewise
local minimum, then Algorithm 2 searches a model with a larger margin
objectives. In this case, there are some samples, each having a negative
sample margin. Suppose the 0-1 loss is 5%, then Algorithm 2 starts here
and interval [0,d) is the one with training error<=5%. We use
epsilon-relaxation to escape the region of zero (or local) 0-1 loss. This
leads to a value of alpha_t that is epsilon greater than d. For each
iteration, DirectBoost is slightly slower than AdaBoost. On lines 340 and
346, we describe the stopping criterion for AdaBoost and DirectBoost,
respectively. Table 2 shows the running times. Though DirectBoost is
slower in each iteration, it requires much fewer iterations and is
therefore more efficient than AdaBoost overall. One of the advantages
of our boosting method might be its robustness to label noise, as
described in Section 3.2.
[1] Hastie etal. The Elements of
Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition,
2009. [2] P. Tseng. Convergence of block coordinate descent method for
nondifferentiable minimization. Journal of Optimization Theory and
Applications, 475--494, 2001.
| |