Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
Ligen Wang, Balázs Kégl
In this paper we propose to combine two powerful ideas, boosting and manifold learning. On the one hand, we improve ADABOOST by incor- porating knowledge on the structure of the data into base classifier design and selection. On the other hand, we use ADABOOST's efficient learn- ing mechanism to significantly improve supervised and semi-supervised algorithms proposed in the context of manifold learning. Beside the spe- cific manifold-based penalization, the resulting algorithm also accommo- dates the boosting of a large family of regularized learning algorithms.
1 Introduction
ADABOOST [1] is one of the machine learning algorithms that have revolutionized pattern recognition technology in the last decade. The algorithm constructs a weighted linear com- bination of simple base classifiers in an iterative fashion. One of the remarkable properties of ADABOOST is that it is relatively immune to overfitting even after the training error has been driven to zero. However, it is now a common knowledge that ADABOOST can overfit if it is run long enough. The phenomenon is particularly pronounced on noisy data, so most of the effort to regularize ADABOOST has been devoted to make it tolerant to outliers by either "softening" the exponential cost function (e.g., [2]) or by explicitly detecting outliers and limiting their influence on the final classifier [3].
In this paper we propose a different approach based on complexity regularization. Rather than focusing on possibly noisy data points, we attempt to achieve regularization by fa- voring base classifiers that are smooth in a certain sense. The situation that motivated the algorithm is not when the data is noisy, rather when it has a certain structure that is ignored by ordinary ADABOOST. Consider, for example, the case when the data set is em- bedded in a high-dimensional space but concentrated around a low dimensional manifold (Figure 1(a)). ADABOOST will compare base classifiers based on solely their weighted errors so, implicitly, it will consider every base classifier having the same (usually low) complexity. On the other hand, intuitively, we may hope to achieve better generalization if we prefer base classifiers that "cut through" sparse regions to base classifiers that cut into "natural" clusters or cut the manifold several times. To formalize this intuition, we use the graph Laplacian regularizer proposed in connection to manifold learning [4] and spectral clustering [5] (Section 3). For binary base classifiers, this penalty is proportional to the number of edges of the neighborhood graph that the classifier cuts (Figure 1(b)).
(a) (b)
Figure 1: (a) Given the data, the vertical stump has a lower "effective" complexity than the horizontal stump. (b) The graph Laplacian penalty is proportional to the number of separated neighbors.
To incorporate this adaptive penalization of base classifiers into ADABOOST, we will turn to the marginal ADABOOST algorithm [6] also known as arc-gv [7]. This algorithm can be interpreted as ADABOOST with an L1 weight decay on the base classifier coefficients with a weight decay coefficient . The algorithm has been used to maximize the hard margin on the data [7, 6] and also for regularization [3]. The coefficient is adaptive in all these applications: in [7] and [6] it depends on the hard margin and the weighted error, respectively, whereas in [3] it is different for every training point and it quantifies the "noisiness" of the points. The idea of this paper is to make dependent on the individual base classifiers, in particular, to set to the regularization penalty of the base classifier. First, with this choice, the objective of base learning becomes standard regularized error minimization so the proposed algorithm accommodates the boosting of a large family of regularized learning algorithms. Second, the coefficients of the base classifiers are lowered proportionally with their complexity, which can be interpreted as an adaptive weight decay. The formulation can be also justified by theoretical arguments which are sketched after the formal description of the algorithm in Section 2.
Experimental results (Section 4) show that the regularized algorithm can improve general- ization. Even when the improvement is not significant, the difference between the training error and the test error decreases significantly and the final classifier is much sparser than ADABOOST's solution, both of which indicate reduced overfitting. Since the Laplacian penalty can be computed without knowing the labels, the algorithm can also be used for semi-supervised learning. Experiments in this context show that algorithm besignificantly the semi-supervised algorithm proposed in [4].
2 The REGBOOST algorithm
For the formal description, let the training data be Dn = (x1, y1), . . . , (xn, yn) where data points (xi, yi) are from the set Rd {-1, 1}. The algorithm maintains a weight distri- bution w(t) = w(t) 1 , . . . , w(t) n over the data points. The weights are initialized uniformly in line 1 (Figure 2), and are updated in each iteration in line 10. We suppose that we are given a base learner algorithm BASE Dn, w, P () that, in each iteration t, returns a base classifier h(t) coming from a subset of H = h : Rd {-1, 1} . In ADABOOST, the goal of the base classifier is to minimize the weighted error
n
= (t)(h) = w(t)I {h(x i i) = yi} , 12 i=1
1The indicator function I{A} is 1 if its argument A is true and 0 otherwise. 2We will omit the iteration index (t) and the argument (h) where it does not cause confusion.
REGBOOST Dn, BASE(, , ), P (), , T
1 w (1/n, . . . , 1/n)
2 for t 1 to T
3 h(t) BASE Dn, w(t), P ()
n
4 (t) w(t)h(t)(x i i)yi edge i=1
5 (t) 2P (h(t)) edge offset
1 1 + (t) 1 - (t) 6 (t) ln base coefficient 2 1 - (t) 1 + (t)
7 if (t) 0 base error (1 - (t))/2
8 return f (t-1)() = t-1 (j)h(j)() j=1
9 for i 1 to n
exp -(t)h(t)(xi)yi 10 w(t+1) w(t) i i n w(t) exp - (t)h(t)(x j=1 j j )yj
11 return f (T )() = T (t)h(t)() t=1
Figure 2: The pseudocode of the REGBOOST algorithm with binary base classifiers. Dn is the training data, BASE is the base learner, P is the penalty functional, is the penalty coefficient, and T is the number of iterations.
which is equivalent to maximizing the edge = 1 - 2 = n w(t)h(x i=1 i i)yi. The goal of REGBOOST's base learner is to minimize the penalized cost
1 1 R1(h) = (h) + P (h) = - ( - ), (1) 2 2
where P : H R is an arbitrary penalty functional or regularization operator, provided to REGBOOST and to the base learner, is the penalty coefficient, and = 2P (h) is the edge offset. Intuitively, the edge quantifies by how much h is better than a random guess, while the edge offset indicates by how much h(t) must be better than a random guess. This means that for complex base classifiers (with large penalties), we require a better base classification than for simple classifiers. The main advantage of R1 is that it has the form of conventional regularized error minimization, so it accommodates the boosting of all learning algorithms that minimize an error functional of this form (e.g., neural networks with weight decay). However, the minimization of R1 is suboptimal from boosting's point of view.3 If computationally possible, the base learner should minimize
1 - 1+ 1- 1 + 1+ 1 - 1- R2(h) = 2 = . (2) 1 + 1 - 1 + 1 -
3This statement along with the formulae for R1, R2, and (t) are explained formally after Theo- rem 1.
After computing the edge and the edge offset in lines 4 and 5, the algorithm sets the coef- ficient (t) of the base classifier h(t) to 1 1 + (t) 1 1 + (t) (t) = ln - ln . (3) 2 1 - (t) 2 1 - (t)
In line 11, the algorithm returns the weighted average of the base classifiers f (T )() = T (t)h(t)() as the combined classifier, and uses the sign of f (T )(x) to classify x. t=1 The algorithm must terminate if (t) 0 which is equivalent to (t) (t) and to (t) (1-(t))/2.4 In this case, the algorithm returns the actual combined classifier in line 8. This means that either the capacity of the set of base classifiers is too small ((t) is small), or the penalty is too high ((t) is high), so we cannot find a new base classifier that would improve the combined classifier. Note that the algorithm is formally equivalent to ADABOOST if (t) 0 and to marginal ADABOOST if (t) is constant.
For the analysis of the algorithm, we first define the unnormalized margin achieved by f (T ) on (xi, yi) as i = f (T )(xi)yi, and the (normalized) margin as T (t)h(t)(x i t=1 i)yi i = = , (4) T 1 (t) t=1 where 1 = T (t) is the L t=1 1 norm of the coefficient vector. Let the average penalty or margin offset be defined as the average edge offset T (t)(t) = t=1 . (5) T (t) t=1 The following theorem upper bounds the marginal training error n 1 L()(f (T )) = I n i < (6) i=1
achieved by the combined classifier f (T ) that REGBOOST outputs.
Theorem 1 Let (t) = 2P (h(t)), let and L()(f (T )) be as defined in (5) and (6), re- spectively. Let w(t) be the weight of training point (x i i, yi) after the tth iteration (updated in line 10 in Figure 2), and let (t) be the weight of the base regressor h(t)() (computed in line 6 in Figure 2). Then T n T
L()(f (T )) e(t)(t) w(t)e-(t)h(t)(xi)yi = E(t) (t), h(t) . (7) i t=1 i=1 t=1
Proof. The proof is an extension of the proof of Theorem 5 in [8]. n T T 1 L()(f (T )) = I (t) - (t)h(t)(x n i)yi 0 (8) i=1 t=1 t=1 n 1 e PT (t)-PT (t)h(t)( t=1 t=1 xi)yi (9) n i=1 T n n = e PT (t) t=1 w(t)e-(t)h(t)(xj)yj w(T +1). (10) j i t=1 j=1 i=1
4Strictly speaking, (t) = 0 could be allowed but in this case the (t) would remain 0 forever so it makes no sense to continue.
In (8) we used the definitions (6) and (4), the inequality (9) holds since ex I{x 0}, and we obtained (10) by recursively applying line 10 in Figure 2. The theorem follows by the definition (5) and since n w(T +1) = 1. i=1 i
First note that Theorem 1 explains the base objectives (1) and (2) and the base coefficient (3). The goal of REGBOOST is the greedy minimization of the exponential bound in (7), that is, in each iteration we attempt to minimize E(t) (, h). Given h(t), E(t) , h(t) is minimized by (3), and with this choice for (t), R2(h) = E(t) (t), h , so the base learner should attempt to minimize R2(h). If this is computationally impossible, we follow Mason et al.'s functional gradient descent approach [2], that is, we find h(t) by maximizing the
negative gradient - E(t)(,h) in = 0. Since - E(t)(,h) = - , this criterion is =0 equivalent to the minimization of R1(h).5
Theorem 1 also suggests various interpretations of REGBOOST which indicate why it would indeed achieve regularization. First, by (9) it can be seen that REGBOOST directly minimizes n 1 exp - n i + 1 , i=1 which can be interpreted as an exponential cost on the unnormalized margin with an L1 weight decay. The weight decay coefficient is proportional to the average complexity of the base classifiers. Second, Theorem 1 also indicates that REGBOOST indirectly min- imizes the marginal error L()(f (T )) (6) where the margin parameter , again, is moving adaptively with the average complexity of the base classifiers. This explanation is sup- ported by theoretical results that bound the generalization error in terms of the marginal error (e.g., Theorem 2 in [8]). The third explanation is based on results that show that the difference between the marginal error and the generalization error can be upper bounded in terms of the complexity of the base classifier class H (e.g., Theorem 4 in [9]). By imposing a non-zero penalty on the base classifiers, we can reduce the pool of admissible functions to those of which the edge is larger than the edge offset . Although the theoretical results do not apply directly, they support the empirical evidence (Section 4) that indicate that the reduction of the pool of admissible base classifiers and the sparsity of the combined classifier play an important role in decreasing the generalization error.
Finally note that the algorithm can be easily extended to real-valued base classifiers along the lines of [10] and to regression by using the algorithm proposed in [11]. If base clas- sifiers come from the set {h : Rd R}, we can only use the base objective R1(h) (1), and the analytical solution (3) for the base coefficients (t) must be replaced by a simple numerical minimization (line search) of E(t) , h(t) .6 In the case of regression, the bi- nary cost function I {h(x) = y} should be replaced by an appropriate regression cost (e.g., quadratic), and the final regressor should be the weighted median of the base regressors instead of their weighted average.
3 The graph Laplacian regularizer
The algorithm can be used with any regularized base learner that optimizes a penalized cost of the form (1). In this paper we apply a smoothness functional based on the graph
5Note that if is constant (ADABOOST or marginal ADABOOST), the minimization of R1(h) and R2(h) leads to the same solution, namely, to the base classifier that minimizes the weighted error . This is no more the case if depends on h. 6As a side remark, note that applying a non-zero (even constant) penalty would provide an alternative solution to the singularity problem ((t) = ) in the abstaining base classifier model of [10].
Laplacian operator, proposed in a similar context by [4]. The advantage of this penalty is that it is relatively simple to compute for enumerable base classifiers (e.g., decision stumps or decision trees) and that it suits applications where the data exhibits a low dimensional manifold structure.
Formally, let G = (V, E) be the neighborhood graph of the training set where the vertex set V = {x1, . . . , xn} is identical to the set of observations, and the edge set E contains pairs of "neighboring" vertices (xi, xj) such that either xi - xj < r or xi (xj) is among the k nearest neighbors of xj (xi) where r or k is fixed. This graph plays a crucial role in several recently developed dimensionality reduction methods since it approximates the natural topology of the data if it is confined to a low-dimensional smooth manifold in the embedding space. To penalize base classifiers that cut through dense regions, we use the smoothness functional
n n 1 2 PL(h) = h(x W 2| i) - h(xj ) ij , W| i=1 j=i+1
where W is the adjacency matrix of G, that is, Wij = I (xi, xj) E , and 2|W| = 2 n n W i=1 j=1 ij is a normalizing factor so that 0 PL(h) 1.7 For binary base classifiers, PL(h) is proportional to the number of separated neighbors, that is, the number of connected pairs that are classified differently by h. Let the diagonal matrix D defined by Dii = n W j=1 ij , and let L = D - W be the graph Laplacian of G. Then it is easy to see that
n
2|W|PL(h) = hLhT = h, Lh = i h, ei , j=1
where h = h(x1), . . . , h(xn) , and ei and i are the (normalized) eigenvectors and eigen- values of L, that is, Lei = iei, ei = 1. Since L is positive definite, all the eigenvalues are non-negative. The eigenvectors with the smallest eigenvalues can be considered as the "smoothest" functions on the neighborhood graph. Based on this observation, [4] proposed to learn a linear combination of a small number of the eigenvectors with the smallest eigen- values. One problem of this approach is that the out-of-sample extension of the obtained classifier is non-trivial since the base functions are only known at the data points that par- ticipated in forming the neighborhood graph, so it can only be used in a semi-supervised settings (when unlabeled test points are known before the learning). Our approach is based on the same intuition, but instead of looking for a linear combination of the eigenvectors, we form a linear combination of known base functions and penalize them according to their smoothness on the underlying manifold. So, beside semi-supervised learning (explored in Section 4), our algorithm can also be used to classify out-of-sample test observations.
The penalty functional can also be justified from the point of view of spectral clustering [5]. The eigenvectors of L with the smallest eigenvalues8 represent "natural" clusters in the data set, so PL(h) is small if h is aligned with these eigenvectors, and PL(h) is large if h splits the corresponding clusters.
7Another variant (that we did not explore in this paper) is to weight edges decreasingly with their lengths. 8Starting from the second smallest; the smallest is 0 and it corresponds to the constant func- tion. Also note that spectral clustering usually uses the eigenvectors of the normalized Laplacian e L = D-1/2LD-1/2. Nevertheless, if the neighborhood graph is constructed by connecting a fixed number of nearest neighbors, Dii is approximately constant, so the eigenvectors of L and e L are approximately equal.
4 Experiments
In this section we present experimental results on four UCI benchmark datasets. The re- sults are preliminary in the sense that we only validated the penalty coefficient , and did not optimize the number of neighbors (set to k = 8) and the weighting scheme of the edges of the neighborhood graph (Wij = 0 or 1). We used decision stumps as base classifiers, 10-fold cross validation for estimating errors, and 5-fold cross validation for determining . The results (Figure 3(a)-(d) and Table 1) show that the REGBOOST consistently improves generalization. Although the improvement is within the standard deviation, the difference between the test and the training error decreases significantly in two of the four experi- ments, which indicates reduced overfitting. The final classifier is also significantly sparser after 1000 iterations (last two columns of Table 1). To measure how the penalty affects the base classifier pool, in each iteration we calculated the number of admissible base classi- fiers relative to the total number of stumps considered by ADABOOST. Figure 3(e) shows that, as expected, REGBOOST traverses only a (sometimes quite small) subset of the base classifier space.
(a) (b) (c) ionosphere breast cancer sonar
0.25 0.09 0.6 training error (AdaBoost) training error (AdaBoost) training error (AdaBoost) test error (AdaBoost) test error (AdaBoost) test error (AdaBoost) training error (RegBoost) 0.08 training error (RegBoost) training error (RegBoost) test error (RegBoost) test error (RegBoost) 0.5 test error (RegBoost) 0.2 0.07
0.06 0.4
0.15 0.05 0.3 0.04 0.1
0.03 0.2
0.02 0.05 0.1 0.01
0 0 0 1 10 100 1000 1 10 100 1000 1 10 100 1000
t t t
pima indians diabetes rate of admissible stumps semi-supervised ionosphere
0.35 1 0.2 training error (AdaBoost) ionosphere training error (AdaBoost) test error (AdaBoost) breast cancer test error (AdaBoost) training error (RegBoost) 0.9 sonar 0.18 training error (RegBoost) test error (RegBoost) pima indians diabetes test error (RegBoost) 0.3 0.16 0.8
0.14 0.7
0.25 0.12 0.6 0.1 0.5 0.2 0.08
0.4 0.06
0.3 0.15 0.04
0.2 0.02
0.1 0.1 0 1 10 100 1000 1 10 100 1000 1 10 100 1000
t t t (d) (e) (f) Figure 3: Learning curves. Test and training errors for the (a) ionosphere, (b) breast cancer, (c) sonar, and (d) Pima Indians diabetes data sets. (e) Rate of admissible stumps. (f) Test and training errors for the ionosphere data set with 100 labeled and 251 unlabeled data points.
data set training error test error # of stumps ADAB REGB ADAB REGB ADAB REGB ionosphere 0% 0% 9.14% (7.1) 7.7% (6.0) 182 114 breast cancer 0% 2.44% 5.29% (3.5) 3.82% (3.7) 58 30 sonar 0% 0% 32.5% (19.8) 29.8% (18.8) 234 199 Pima Indians 10.9% 16.0% 25.3% (5.3) 23.3% (6.8) 175 91
Table 1: Errors rates and number of base classifiers after 1000 iterations.
Since the Laplacian penalty can be computed without knowing the labels, the algorithm can also be used for semi-supervised learning. Figure 3(f) shows the results when only a subset of the training points are labeled. In this case, REGBOOST can use the combined data set to calculate the penalty, whereas both algorithms can use only the labeled points
to determine the base errors. Figure 3(f) indicates that REGBOOST has a clear advantage here. REGBOOST is also far better than the semi-supervised algorithm proposed in [12] (their best test error using the same settings is 18%).