{"title": "Scalable Semi-Supervised Aggregation of Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 1351, "page_last": 1359, "abstract": "We present and empirically evaluate an efficient algorithm that learns to aggregate the predictions of an ensemble of binary classifiers. The algorithm uses the structure of the ensemble predictions on unlabeled data to yield significant performance improvements. It does this without making assumptions on the structure or origin of the ensemble, without parameters, and as scalably as linear learning. We empirically demonstrate these performance gains with random forests.", "full_text": "Scalable Semi-Supervised Aggregation of Classi\ufb01ers\n\nAkshay Balsubramani\n\nUC San Diego\n\nabalsubr@cs.ucsd.edu\n\nYoav Freund\nUC San Diego\n\nyfreund@cs.ucsd.edu\n\nAbstract\n\nWe present and empirically evaluate an ef\ufb01cient algorithm that learns to aggre-\ngate the predictions of an ensemble of binary classi\ufb01ers. The algorithm uses the\nstructure of the ensemble predictions on unlabeled data to yield signi\ufb01cant perfor-\nmance improvements. It does this without making assumptions on the structure or\norigin of the ensemble, without parameters, and as scalably as linear learning. We\nempirically demonstrate these performance gains with random forests.\n\n1\n\nIntroduction\n\nEnsemble-based learning is a very successful approach to learning classi\ufb01ers, including well-known\nmethods like boosting [1], bagging [2], and random forests [3]. The power of these methods has\nbeen clearly demonstrated in open large-scale learning competitions such as the Net\ufb02ix Prize [4]\nand the ImageNet Challenge [5]. In general, these methods train a large number of \u201cbase\u201d classi\ufb01ers\nand then combine them using a (possibly weighted) majority vote. By aggregating over classi\ufb01ers,\nensemble methods reduce the variance of the predictions, and sometimes also reduce the bias [6].\nThe ensemble methods above rely solely on a labeled training set of data. In this paper we propose\nan ensemble method that uses a large unlabeled data set in addition to the labeled set. Our work is\ntherefore at the intersection of semi-supervised learning [7, 8] and ensemble learning.\nThis paper is based on recent theoretical results of the authors [9]. Our main contributions here are\nto extend and apply those results with a new algorithm in the context of random forests [3] and to\nperform experiments in which we show that, when the number of labeled examples is small, our\nalgorithm\u2019s performance is at least that of random forests, and often signi\ufb01cantly better.\nFor the sake of completeness, we provide an intuitive introduction to the analysis given in [9]. How\ncan unlabeled data help in the context of ensemble learning? Consider a simple example with six\nequiprobable data points. The ensemble consists of six classi\ufb01ers, partitioned into three \u201cA\u201d rules\nand three \u201cB\u201d rules. Suppose that the \u201cA\u201d rules each have error 1/3 and the \u201cB\u201d rules each have error\n1/6. 1 If given only this information, we might take the majority vote over the six rules, possibly\ngiving lower weights to the \u201cA\u201d rules because they have higher errors.\nSuppose, however, that we are given the unlabeled information in Table 1. The columns of this table\ncorrespond to the six classi\ufb01ers and the rows to the six unlabeled examples. Each entry corresponds\nto the prediction of the given classi\ufb01er on the given example. As we see, the main difference between\nthe \u201cA\u201d rules and the \u201cB\u201d rules is that any two \u201cA\u201d rules disagree with probability 1/3, whereas the\n\u201cB\u201d rules always agree. For this example, it can be seen (e.g. proved by contradiction) that the only\npossible true labeling of the unlabeled data that is consistent with Table 1 and with the errors of the\nclassi\ufb01ers is that all the examples are labeled \u2019+\u2019.\nConsequently, we conclude that the majority vote over the \u201cA\u201d rules has zero error, performing\nsigni\ufb01cantly better than any of the base rules. In contrast, giving the \u201cB\u201d rules equal weight would\n\n1We assume that (bounds on) the errors are, with high probability, true on the actual distribution. Such\n\nbounds can be derived using large deviation bounds or bootstrap-type methods.\n\n1\n\n\fresult in an a rule with error 1/6. Crucially, our reasoning to this point has solely used the structure\nof the unlabeled examples along with the error rates in Table 1 to constrain our search for the true\nlabeling.\n\nA classi\ufb01ers\n+\n-\n+\n-\n+\n+\n+\n+\n-\n+\n-\n+\n1/3\n1/3\n\n+\n+\n-\n-\n+\n+\n1/3\n\nB classi\ufb01ers\n+\n+\n+\n+\n+\n+\n+\n+\n+\n+\n-\n-\n1/6\n1/6\n\n+\n+\n+\n+\n+\n-\n1/6\n\nx1\nx2\nx3\nx4\nx5\nx6\nerror\n\nTable 1: An example of the utility of unlabeled examples in ensemble learning\n\nBy such reasoning alone, we have correctly predicted according to a weighted majority vote. This\nexample provides some insight into the ways in which unlabeled data can be useful:\n\n\u2022 When combining classi\ufb01ers, diversity is important. It can be better to combine less accurate\nrules that disagree with each other than to combine more accurate rules that tend to agree.\n\u2022 The bounds on the errors of the rules can be seen as a set of constraints on the true labeling.\nA complementary set of constraints is provided by the unlabeled examples. These sets of\nconstraints can be combined to improve the accuracy of the ensemble classi\ufb01er.\n\nThe above setup was recently introduced and analyzed in [9]. That paper characterizes the problem\nas a zero-sum game between a predictor and an adversary. It then describes the minimax solution of\nthe game, which corresponds to an ef\ufb01cient algorithm for transductive learning.\nIn this paper, we build on the worst-case framework of [9] to devise an ef\ufb01cient and practical semi-\nsupervised aggregation algorithm for random forests. To achieve this, we extend the framework to\nhandle specialists \u2013 classi\ufb01ers which only venture to predict on a subset of the data, and abstain\nfrom predicting on the rest. Specialists can be very useful in targeting regions of the data on which\nto precisely suggest a prediction.\nThe high-level idea of our algorithm is to arti\ufb01cially generate new specialists from the ensemble.\nWe incorporate these, and the targeted information they carry, into the worst-case framework of [9].\nThe resulting aggregated predictor inherits the advantages of the original framework:\n\n(A) Ef\ufb01cient: Learning reduces to solving a scalable p-dimensional convex optimization, and\n\ntest-time prediction is as ef\ufb01cient and parallelizable as p-dimensional linear prediction.\n\n(B) Versatile/robust: No assumptions about the structure or origin of the predictions or labels.\n(C) No introduced parameters: The aggregation method is completely data-dependent.\n(D) Safe: Accuracy guaranteed to be at least that of the best classi\ufb01er in the ensemble.\n\nWe develop these ideas in the rest of this paper, reviewing the core worst-case setting of [9] in Section\n2, and specifying how to incorporate specialists and the resulting learning algorithm in Section 3.\nThen we perform an exploratory evaluation of the framework on data in Section 4. Though the\nframework of [9] and our extensions can be applied to any ensemble of arbitrary origin, in this\npaper we focus on random forests, which have been repeatedly demonstrated to have state-of-the-\nart, robust classi\ufb01cation performance in a wide variety of situations [10]. We use a random forest\nas a base ensemble whose predictions we aggregate. But unlike conventional random forests, we\ndo not simply take a majority vote over tree predictions, instead using a unlabeled-data-dependent\naggregation strategy dictated by the worst-case framework we employ.\n\n2 Preliminaries\n\nA few de\ufb01nitions are required to discuss these issues concretely, following [9]. Write [a]+ =\nmax(0, a) and [n] = {1, 2, . . . , n}. All vector inequalities are componentwise.\n\n2\n\n\fWe \ufb01rst consider an ensemble H = {h1, . . . , hp} and unlabeled data x1, . . . , xn on which we wish\nto predict. As in [9], the predictions and labels are allowed to be randomized, represented by values\nin [\u22121, 1] instead of just the two values {\u22121, 1}. The ensemble\u2019s predictions on the unlabeled data\nare denoted by F:\n\nF =\n\n\uf8f6\uf8f7\uf8f8 \u2208 [\u22121, 1]p\u00d7n\n\n...\n\n...\n\n\u00b7\u00b7\u00b7 h1(xn)\n...\n\u00b7\u00b7\u00b7 hp(xn)\n\n\uf8eb\uf8ec\uf8edh1(x1) h1(x2)\n(cid:80)\nj hi(xj)zj \u2265 bi, i.e. 1\n\nhp(x1) hp(x2)\n\n...\n\nWe use vector notation for the rows and columns of F: hi = (hi(x1),\u00b7\u00b7\u00b7 , hi(xn))(cid:62) and xj =\n(h1(xj),\u00b7\u00b7\u00b7 , hp(xj))(cid:62). The true labels on the test data T are represented by z = (z1; . . . ; zn) \u2208\n[\u22121, 1]n. The labels z are hidden from the predictor, but we assume the predictor has knowledge of\na correlation vector b \u2208 (0, 1]p such that 1\nn Fz \u2265 b. These p constraints\non z exactly represent upper bounds on individual classi\ufb01er error rates, which can be estimated from\nthe training set w.h.p. when all the data are drawn i.i.d., in a standard way also used by empirical\nrisk minimization (ERM) methods that simply predict with the minimum-error classi\ufb01er [9].\n\nn\n\n(1)\n\n(3)\n\n2.1 The Transductive Binary Classi\ufb01cation Game\n\nThe idea of [9] is to formulate the ensemble aggregation problem as a two-player zero-sum game\nbetween a predictor and an adversary.\nIn this game, the predictor is the \ufb01rst player, who plays\ng = (g1; g2; . . . ; gn), a randomized label gi \u2208 [\u22121, 1] for each example {xi}n\ni=1. The adversary\nthen sets the labels z \u2208 [\u22121, 1]n under the ensemble classi\ufb01er error constraints de\ufb01ned by b. 2 The\npredictor\u2019s goal is to minimize the worst-case expected classi\ufb01cation error on the test data (w.r.t.\nthe randomized labelings z and g), which is just 1\n2\nn z(cid:62)g. To summarize concretely, we study the following game:\nmaximizing worst-case correlation 1\n\nn z(cid:62)g(cid:1). This is equivalently viewed as\n\n(cid:0)1 \u2212 1\n\nV := max\n\ng\u2208[\u22121,1]n\n\nmin\nz\u2208[\u22121,1]n,\nn Fz\u2265b\n\n1\n\nz(cid:62)g\n\n1\nn\n\n(2)\n\n1\n\n1\nn\n\nz(cid:62)g\u2217 \u2265 V , guaranteeing worst-case prediction error 1\n\nThe minimax theorem ([1], p.144) applies to the game (2), and there is an optimal strategy g\u2217 such\n2 (1 \u2212 V ) on the n unlabeled\nthat min\nz\u2208[\u22121,1]n,\nn Fz\u2265b\ndata. This optimal strategy g\u2217 is a simple function of a particular weighting over the p hypotheses \u2013\na nonnegative p-vector.\nDe\ufb01nition 1 (Slack Function). Let \u03c3 \u2265 0p be a weight vector over H (not necessarily a distribution).\nThe vector of ensemble predictions is F(cid:62)\u03c3 = (x(cid:62)\nn \u03c3), whose elements\u2019 magnitudes are\nthe margins. The prediction slack function is\n\n1 \u03c3, . . . , x(cid:62)\n\n\u03b3(\u03c3, b) := \u03b3(\u03c3) := \u2212b(cid:62)\u03c3 +\n\n1\nn\n\n(cid:2)(cid:12)(cid:12)x(cid:62)\nj \u03c3(cid:12)(cid:12) \u2212 1(cid:3)\n\n+\n\nn(cid:88)\n\nj=1\n\nand this is convex in \u03c3. The optimal weight vector \u03c3\u2217 is any minimizer \u03c3\u2217 \u2208 arg min\n\u03c3\u22650p\n\n[\u03b3(\u03c3)].\n\nThe main result of [9] uses these to describe the minimax equilibrium of the game (2).\nTheorem 2 ([9]). The minimax value of the game (2) is V = \u2212\u03b3(\u03c3\u2217). The minimax optimal\npredictions are de\ufb01ned as follows: for all j \u2208 [n],\nj \u03c3\u2217\nsgn(x(cid:62)\n\n(cid:12)(cid:12)x(cid:62)\nj \u03c3\u2217(cid:12)(cid:12) < 1\n\ng\u2217\nj := gj(\u03c3\u2217) =\n\n(cid:26)x(cid:62)\n\notherwise\n\nj \u03c3\u2217)\n\n2Since b is calculated from the training set and deviation bounds, we assume the problem feasible w.h.p.\n\n3\n\n\f2.2\n\nInterpretation\n\nTheorem 2 suggests a statistical learning algorithm for aggregating the p ensemble classi\ufb01ers\u2019 pre-\ndictions: estimate b from the training (labeled) set, optimize the convex slack function \u03b3(\u03c3) to \ufb01nd\n\u03c3\u2217, and \ufb01nally predict with gj(\u03c3\u2217) on each example j in the test set. The resulting predictions are\nguaranteed to have low error, as measured by V . In particular, it is easy to prove [9] that V is at least\nmaxi bi, the performance of the best classi\ufb01er.\nThe slack function (3) merits further scrutiny. Its \ufb01rst term depends only on the labeled data and\nnot the unlabeled set, while the second term 1\nincorporates only unlabeled\nn\ninformation. These two terms trade off smoothly \u2013 as the problem setting becomes fully supervised\nand unlabeled information is absent, the \ufb01rst term dominates, and \u03c3\u2217 tends to put all its weight on\nthe best single classi\ufb01er like ERM.\nIndeed, this viewpoint suggests a (loose) interpretation of the second term as an unsupervised regu-\nlarizer for the otherwise fully supervised optimization of the \u201caverage\u201d error b(cid:62)\u03c3. It turns out that\na change in the regularization factor corresponds to different constraints on the true labels z:\n\n(cid:2)(cid:12)(cid:12)x(cid:62)\nj \u03c3(cid:12)(cid:12) \u2212 1(cid:3)\n\n(cid:80)n\n\nj=1\n\n+\n\nz(cid:62)g for any \u03b1 > 0.\n\nThen V\u03b1 =\n\n1\nn\n\nTheorem 3 ([9]). Let V\u03b1 :=\n\n(cid:104)\u2212b(cid:62)\u03c3 + \u03b1\n\n(cid:80)n\n\ng\u2208[\u22121,1]n\n\nmax\n\n(cid:2)(cid:12)(cid:12)x(cid:62)\nj \u03c3(cid:12)(cid:12) \u2212 1(cid:3)\n\n(cid:105)\n\n.\n\nmin\nz\u2208[\u2212\u03b1,\u03b1]n,\nn Fz\u2265b\n\n1\n\nn\n\nj=1\n\nmin\u03c3\u22650p\nSo the regularized optimization assumes each zi \u2208 [\u2212\u03b1, \u03b1]. For \u03b1 < 1, this is equivalent to assum-\ning the usual binary labels (\u03b1 = 1), and then adding uniform random label noise: \ufb02ipping the label\n2 (1\u2212\u03b1) on each of the n examples independently. This encourages \u201cclipping\u201d of the ensemble\nw.p. 1\npredictions x(cid:62)\n\nj \u03c3\u2217 to the \u03c3\u2217-weighted majority vote predictions, as speci\ufb01ed by g\u2217.\n\n+\n\n2.3 Advantages and Disadvantages\n\nThis formulation has several signi\ufb01cant merits that would seem to recommend its use in practical\nsituations.\nIt is very ef\ufb01cient \u2013 once b is estimated (a scalable task, given the labeled set), the\nslack function \u03b3 is effectively an average over convex functions of i.i.d. unlabeled examples, and\nconsequently is amenable to standard convex optimization techniques [9] like stochastic gradient\ndescent (SGD) and variants. These only operate in p dimensions, independent of n (which is (cid:29) p).\nThe slack function is Lipschitz and well-behaved, resulting in stable approximate learning.\nMoreover, test-time prediction is extremely ef\ufb01cient, because it only requires the p-dimensional\nweighting \u03c3\u2217 and can be computed example-by-example on the test set using only a dot product\nin Rp. The form of g\u2217 and its dependence on \u03c3\u2217 facilitates interpretation as well, as it resembles\nfamiliar objects: sigmoid link functions for linear classi\ufb01ers.\nOther advantages of this method also bear mention: it makes no assumptions on the structure of H\nor F, is provably robust against the worst case, and adds no input parameters that need tuning. These\nbene\ufb01ts are notable because they will be inherited by our extension of the framework in this paper.\nHowever, this algorithm\u2019s practical performance can still be mediocre on real data, which is often\neasier to predict than an adversarial setup would have us believe. As a result, we seek to add more\ninformation in the form of constraints on the adversary, to narrow the gap between it and reality.\n\n3 Learning with Specialists\n\nTo address this issue, we examine a generalized scenario in which each classi\ufb01er in the ensemble\ncan abstain on any subset of the examples instead of predicting \u00b11. It is a specialist that predicts\nonly over a subset of the input, and we think of its abstain/participate decision being randomized in\nthe same way as the randomized label on each example. In this section, we extend the framework of\nSection 2.1 to arbitrary specialists, and discuss the semi-supervised learning algorithm that results.\nIn our formulation, suppose that for a classi\ufb01er i \u2208 [p] and an example x, the classi\ufb01er decides\nto abstain with probability 1 \u2212 vi(x). But if the decision is to participate, the classi\ufb01er predicts\n\n4\n\n\fthat(cid:80)n\n\nhi(x) \u2208 [\u22121, 1] as previously. Our only assumption on {vi(x1), . . . , vi(xn)} is the reasonable one\n\nj=1 vi(xj) > 0, so classi\ufb01er i is not a worthless specialist that abstains everywhere.\n\nThe constraint on classi\ufb01er i is now not on its correlation with z on the entire test set, but on the\naverage correlation with z restricted to occasions on which it participates. So for some [bS]i \u2208 [0, 1],\n\nDe\ufb01ne \u03c1i(xj) :=\nunlabeled data matrix as follows:\n\n(cid:80)n\nk=1 vi(xk) (a distribution over j \u2208 [n]) for convenience. Now rede\ufb01ne our\n\nvi(xj )\n\nj=1\n\n(cid:18)\n\n(cid:80)n\n\nvi(xj)\nk=1 vi(xk)\n\n(cid:19)\nn(cid:88)\n\uf8eb\uf8ec\uf8ed\u03c11(x1)h1(x1) \u03c11(x2)h1(x2)\n\n...\n\n...\n\n\u03c1p(x1)hp(x1) \u03c1p(x2)hp(x2)\n\nS = n\n\nhi(xj)zj \u2265 [bS]i\n\n\uf8f6\uf8f7\uf8f8\n\n\u00b7\u00b7\u00b7 \u03c11(xn)h1(xn)\n...\n\u00b7\u00b7\u00b7 \u03c1p(xn)hp(xn)\n\n...\n\nn Sz \u2265 bS, analogous to the initial prediction game (2).\n\nThen the constraints (4) can be written as 1\nTo summarize, our specialist ensemble aggregation game is stated as\nz(cid:62)g\n\nmax\n\nVS := min\nz\u2208[\u22121,1]n,\nn Sz\u2265bS\n\n1\n\ng\u2208[\u22121,1]n\n\n1\nn\n\nWe can immediately solve this game from Thm. 2, with (S, bS) simply taking the place of (F, b).\nTheorem 4 (Solution of the Specialist Aggregation Game). The awake ensemble prediction w.r.t.\n\n(4)\n\n(5)\n\n(6)\n\nweighting \u03c3 \u2265 0p on example xi is(cid:2)S(cid:62)\u03c3(cid:3)\nn(cid:88)\n\n\u03b3S(\u03c3) :=\n\n1\nn\n\n[g\u2217\nS]i\n\n= gS(\u03c3\u2217\n.\n\nS) =\n\n\u03c1j(xi)hj(xi)\u03c3j . The slack function is now\n\n\u2212 b(cid:62)\nS \u03c3\n\n+\n\n(7)\n\n(cid:12)(cid:12)(cid:12) \u2212 1\n(cid:105)\n\nj=1\n\ni = n\n\np(cid:88)\n(cid:104)(cid:12)(cid:12)(cid:12)(cid:2)S(cid:62)\u03c3(cid:3)\n(cid:3)\n(cid:26)(cid:2)S(cid:62)\u03c3\u2217\nsgn((cid:2)S(cid:62)\u03c3\u2217\n\nS\n\nj\n\ni\n\nj=1\n\n(cid:12)(cid:12)(cid:2)S(cid:62)\u03c3\u2217\n\nS\n\n(cid:3)\n\ni\notherwise\n\n(cid:12)(cid:12) < 1\n\n(cid:3)\n\ni)\n\nS\n\nThe minimax value of this game is VS = max\u03c3\u22650p [\u2212\u03b3S(\u03c3)] = \u2212\u03b3S(\u03c3\u2217\npredictions are de\ufb01ned as follows: for all i \u2208 [n],\n\nS). The minimax optimal\n\nn) for any i \u2208 [p], and\nIn the no-specialists case, the vector \u03c1i is the uniform distribution ( 1\nthe problem reduces to the prediction game (2). As in the original prediction game, the minimax\nequilibrium depends on the data only through the ensemble predictions, but these are now of a\ndifferent form. Each example is now weighted proportionally to \u03c1j(xi). So on any given example\nxi, only hypotheses which participate on it will be counted; and those that specialize the most\nnarrowly, and participate on the fewest other examples, will have more in\ufb02uence on the eventual\nprediction gi, ceteris paribus.\n\nn , . . . , 1\n\n3.1 Creating Specialists for an Algorithm\n\nWe can now present the main ensemble aggregation method of this paper, which creates spe-\ncialists from the ensemble, adding them as additional constraints (rows of S). The algorithm,\nHEDGECLIPPER, is given in Fig. 1, and instantiates our specialist learning framework with a ran-\ndom forest [3]. As an initial exploration of the framework here, random forests are an appropriate\nbase ensemble because they are known to exhibit state-of-the-art performance [10]. Their well-\nknown advantages also include scalability, robustness (to corrupt data and parameter choices), and\ninterpretability; each of these bene\ufb01ts is shared by our aggregation algorithm, which consequently\ninherits them all.\nFurthermore, decision trees are a natural \ufb01t as the ensemble classi\ufb01ers because they are inherently\nhierarchical. Intuitively (and indeed formally too [11]), they act like nearest-neighbor (NN) pre-\ndictors w.r.t. a distance that is \u201cadaptive\u201d to the data. So each tree in a random forest represents a\n\n5\n\n\fsomewhat different, nonparametric partition of the data space into regions in which one of the labels\n\u00b11 dominates. Each such region corresponds exactly to a leaf of the tree.\nThe idea of HEDGECLIPPER is simply to consider each leaf in the forest as a specialist, which\npredicts only on the data falling into it. By the NN intuition above, these specialists can be viewed\nas predicting on data that is near them, where the supervised training of the tree attempts to de\ufb01ne\nthe purest possible partitioning of the space. A pure partitioning results in many specialists with\n[bS]i \u2248 1, each of which contributes to the awake ensemble prediction w.r.t. \u03c3\u2217 over its domain, to\nin\ufb02uence it towards the correct label (inasmuch as [bS]i is high).\nThough the idea is complex in concept for a large forest with many arbitrarily overlapping leaves\nfrom different trees, it \ufb01ts the worst-case specialist framework of the previous sections. So the\nalgorithm is still essentially linear learning with convex optimization, as we have described.\n\n(regularized; see Sec. 3.2)\n\nAlgorithm 1 HEDGECLIPPER\nInput: Labeled set L, unlabeled set U\n1: Using L, grow trees T = {T1, . . . , Tp}\n2: Using L, estimate bS on T and its leaves\n3: Using U, (approximately) optimize (7)\nOutput: The estimated weighting \u03c3\u2217\n\nS, for\n\nto estimate \u03c3\u2217\n\nS\n\nuse at test time\n\nFigure 1: At left is algorithm HEDGECLIPPER. At right is a schematic of how the forest structure is related\nto the unlabeled data matrix S, with a given example x highlighted. The two colors in the matrix represent \u00b11\npredictions, and white cells abstentions.\n\n3.2 Discussion\n\nTrees in random forests have thousands of leaves or more in practice. As we are advocating adding\nso many extra specialists to the ensemble for the optimization, it is natural to ask whether this erodes\nsome of the advantages we have claimed earlier.\nComputationally, it does not. When \u03c1j(xi) = 0, i.e. classi\ufb01er j abstains deterministically on xi,\nthen the value of hj(xi) is irrelevant. So storing S in a sparse matrix format is natural in our setup,\nwith the accompanying performance gain in computing S(cid:62)\u03c3 while learning \u03c3\u2217 and predicting with\nit. This turns out to be crucial to ef\ufb01ciency \u2013 each tree induces a partitioning of the data, so the set\nof rows corresponding to any tree contains n nonzero entries in total. This is seen in Fig. 1.\nStatistically, the situation is more complex. On one hand, there is no danger of over\ufb01tting in the\ntraditional sense, regardless of how many specialists are added. Each additional specialist can only\nshrink the constraint set that the adversary must follow in the game (6). It only adds information\nabout z, and therefore the performance VS must improve, if the game is solved exactly.\nHowever, for learning we are only concerned with approximately optimizing \u03b3S(\u03c3) and solving the\ngame. This presents several statistical challenges. Standard optimization methods do not converge\nas well in high ambient dimension, even given the structure of our problem. In addition, random\nforests practically perform best when each tree is grown to over\ufb01t. In our case, on any sizable test\nset, small leaves would cause some entries of S to have large magnitude, (cid:29) 1. This can foil an\nalgorithm like HEDGECLIPPER by causing it to vary wildly during the optimization, particularly\nsince those leaves\u2019 [bS]i values are only roughly estimated.\nFrom an optimization perspective, some of these issues can be addressed by e.g. (pseudo)-second-\norder methods [12], whose effect would be interesting to explore in future work. Our implementation\nopts for another approach \u2013 to grow trees constrained to have a nontrivial minimum weight per leaf.\nOf course, there are many other ways to handle this, including using the tree structure beyond the\nleaves; we just aim to conduct an exploratory evaluation here, as several of these areas remain ripe\nfor future research.\n\n6\n\n\f4 Experimental Evaluation\n\nWe now turn to evaluating HEDGECLIPPER on publicly available datasets. Our implementation uses\nminibatch SGD to optimize (6), runs in Python on top of the popular open-source learning package\nscikit-learn, and runs out-of-core (n-independent memory), taking advantage of the scalabil-\nity of our formulation. 3 The datasets are drawn from UCI/LibSVM as well as data mining sites\nlike Kaggle, and no further preprocessing was done on the data. We refer to \u201cBase RF\u201d as the forest\nof constrained trees from which our implementation draws its specialists. We restrict the train-\ning data available to the algorithm, using mostly supervised datasets because these far outnumber\nmedium/large-scale public semi-supervised datasets. Unused labeled examples are combined with\nthe test examples (and the extra unlabeled set, if any is provided) to form the set of unlabeled data\nused by the algorithm. Further information and discussion on the protocol is in the appendix.\nClass-imbalanced and noisy sets are included to demonstrate the aforementioned practical advan-\ntages of HEDGECLIPPER. Therefore, AUC is an appropriate measure of performance, and these\nresults are in Table 2. Results are averaged over 10 runs, each drawing a different random subsam-\nple of labeled data. The best results according to a paired t-test are in bold.\nWe \ufb01nd that the use of unlabeled data is suf\ufb01cient to achieve improvements over even traditionally\nover\ufb01tted RFs in many cases. Notably, in most cases there is a signi\ufb01cant bene\ufb01t given by unlabeled\ndata in our formulation, as compared to the base RF used. The boosting-type methods also perform\nfairly well, as we discuss in the next section.\n\nFigure 2: Class-conditional \u201cawake ensemble prediction\u201d (x(cid:62)\u03c3\u2217) distributions, on SUSY. Rows (top to bot-\ntom): {1K, 10K, 100K} labels. Columns (left to right): \u03b1 = {1.0, 0.3, 3.0}, and the base RF.\n\nThe awake ensemble prediction values x(cid:62)\u03c3 on the unlabeled set are a natural way to visualize and\nexplore the operation of the algorithm on the data, in an analogous way to the margin distribution in\nboosting [6]. One representative sample is in Fig. 2, on SUSY, a dataset with many (5M) examples,\nroughly evenly split between \u00b11. These plots demonstrate that our algorithm produces much more\npeaked class-conditional ensemble prediction distributions than random forests, suggesting margin-\nbased learning applications. Changing \u03b1 alters the aggressiveness of the clipping, inducing a more\nor less peaked distribution. The other datasets without dramatic label imbalance show very similar\nqualitative behavior in these respects, and these plots help choose \u03b1 in practice (see appendix).\nToy datasets with extremely low dimension seem to exhibit little to no signi\ufb01cant improvement\nfrom our method. We believe this is because the distinct feature splits found by the random forest\nare few in number, and it is the diversity in ensemble predictions that enables HEDGECLIPPER to\nclip (weighted majority vote) dramatically and achieve its performance gains.\nOn the other hand, given a large quantity of data, our algorithm is able to learn signi\ufb01cant structure,\nthe minimax structure appears appreciably close to reality, as evinced by the results on large datasets.\n\n5 Related and Future Work\n\nThis paper\u2019s framework and algorithms are super\ufb01cially reminiscent of boosting, another paradigm\nthat uses voting behavior to aggregate an ensemble and has a game-theoretic intuition [1, 15]. There\nis some work on semi-supervised versions of boosting [16], but it departs from this principled struc-\nture and has little in common with our approach. Classical boosting algorithms like AdaBoost [17]\nmake no attempt to use unlabeled data. It is an interesting open problem to incorporate boosting\nideas into our formulation, particularly since the two boosting-type methods acquit themselves well\n\n3It is possible to make this footprint independent of d as well by hashing features [13], not done here.\n\n7\n\n\fLogistic\nRegression\n\nDataset\n\nkagg-prot\n\nssl-text\n\nkagg-cred\n\nAdaBoost\n\n#\n\ntraining\n\nHEDGECLIPPER\n\n0.510\n0.688\n0.501\n0.552\n0.502\n0.505\n0.510\n0.725\n0.768\n0.509\n0.671\n0.515\n0.524\n0.515\n0.557\n0.629\n0.683\n0.720\n0.759\n0.779\n0.774\n0.840\n0.901\nTable 2: Area under ROC curve for HEDGECLIPPER vs. supervised ensemble algorithms.\n\n10\n100\n10\n100\n100\n1K\n10K\n100\n1K\n100\n1K\n100\n1K\n10K\n10\n100\n1K\n1K\n10K\n100K\n1K\n1K\n10K\n\n0.567\n0.714\n0.586\n0.765\n0.558\n0.602\n0.603\n0.779\n0.808\n0.543\n0.651\n0.735\n0.764\n0.809\n0.572\n0.656\n0.687\n0.776\n0.785\n0.799\n0.651\n0.936\n0.967\n\nBase RF\n0.500\n0.656\n0.512\n0.542\n0.501\n0.510\n0.535\n0.525\n0.655\n0.505\n0.520\n0.661\n0.715\n0.785\n0.535\n0.610\n0.646\n0.764\n0.784\n0.797\n0.645\n0.920\n0.957\n\nRandom\nForest\n0.509\n0.665\n0.517\n0.551\n0.518\n0.534\n0.563\n0.619\n0.714\n0.510\n0.592\n0.703\n0.761\n0.822\n0.574\n0.645\n0.682\n0.769\n0.787\n0.797\n0.659\n0.928\n0.970\n\nMART\n[14]\n0.497\n0.666\n0.553\n0.569\n0.542\n0.565\n0.566\n0.682\n0.722\n0.513\n0.689\n0.732\n0.761\n0.788\n0.557\n0.637\n0.689\n0.771\n0.789\n0.796\n0.726\n0.928\n0.953\n\nTrees\n0.520\n0.681\n0.556\n0.596\n0.528\n0.585\n0.586\n0.680\n0.734\n0.502\n0.695\n0.709\n0.730\n0.759\n0.563\n0.643\n0.690\n0.747\n0.787\n0.797\n0.718\n0.923\n0.945\n\na1a\n\nw1a\n\ncovtype\n\nssl-secstr\n\nSUSY\n\nepsilon\n\nwebspam-uni\n\nin our results, and can pack information parsimoniously into many fewer ensemble classi\ufb01ers than\nrandom forests.\nThere is a long-recognized connection between transductive and semi-supervised learning, and our\nmethod bridges these two settings. Popular variants on supervised learning such as the transductive\nSVM [18] and graph-based or nearest-neighbor algorithms, which dominate the semi-supervised\nliterature [8], have shown promise largely in data-poor regimes because they face major scalability\nchallenges. Our focus on ensemble aggregation instead allows us to keep a computationally inex-\npensive linear formulation and avoid considering the underlying feature space of the data. Largely\nunsupervised ensemble methods have been explored especially in applications like crowdsourcing,\nin which the method of [19] gave rise to a plethora of Bayesian methods under various conditional\nindependence generative assumptions on F [20]. Using tree structure to construct new features has\nbeen applied successfully, though without guarantees [21].\nLearning with specialists has been studied in an adversarial online setting as in the work of Freund\net al. [22]. Though that paper\u2019s setting and focus is different from ours, the optimal algorithms it\nderives also depend on each specialist\u2019s average error on the examples on which it is awake.\nFinally, we re-emphasize the generality of our formulation, which leaves many interesting questions\nto be explored. The specialists we form are not restricted to being trees; there are other ways of\ndividing the data like clustering methods. Indeed, the ensemble can be heterogeneous and even\nincorporate other semi-supervised methods. Our method is complementary to myriad classi\ufb01cation\nalgorithms, and we hope to stimulate inquiry into the many research avenues this opens.\n\nAcknowledgements\n\nThe authors acknowledge support from the National Science Foundation under grant IIS-1162581.\n\n8\n\n\fReferences\n[1] Robert E. Schapire and Yoav Freund. Boosting: Foundations and Algorithms. The MIT Press, 2012.\n[2] Leo Breiman. Bagging predictors. Machine learning, 24(2):123\u2013140, 1996.\n[3] Leo Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[4] Yehuda Koren. The bellkor solution to the net\ufb02ix grand prize. 2009.\n[5] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.\n\n[6] Robert E Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explana-\n\ntion for the effectiveness of voting methods. Annals of statistics, pages 1651\u20131686, 1998.\n\n[7] Olivier Chapelle, Bernhard Sch\u00a8olkopf, and Alexander Zien. Semi-supervised learning.\n[8] Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis lectures on\n\narti\ufb01cial intelligence and machine learning, 3(1):1\u2013130, 2009.\n\n[9] Akshay Balsubramani and Yoav Freund. Optimally combining classi\ufb01ers using unlabeled data. In Con-\n\nference on Learning Theory, 2015.\n\n[10] Rich Caruana, Nikos Karampatziakis, and Ainur Yessenalina. An empirical evaluation of supervised\nlearning in high dimensions. In Proceedings of the 25th international conference on Machine learning,\npages 96\u2013103. ACM, 2008.\n\n[11] Yi Lin and Yongho Jeon. Random forests and adaptive nearest neighbors. Journal of the American\n\nStatistical Association, 101(474):578\u2013590, 2006.\n\n[12] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[13] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing\nfor large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine\nLearning, pages 1113\u20131120. ACM, 2009.\n\n[14] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics,\n\npages 1189\u20131232, 2001.\n\n[15] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the\n\nninth annual conference on Computational learning theory, pages 325\u2013332. ACM, 1996.\n\n[16] P Kumar Mallapragada, Rong Jin, Anil K Jain, and Yi Liu. Semiboost: Boosting for semi-supervised\n\nlearning. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(11):2000\u20132014, 2009.\n\n[17] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. J. Comput. Syst. Sci., 55(1):119\u2013139, 1997.\n\n[18] Thorsten Joachims. Transductive inference for text classi\ufb01cation using support vector machines.\n\nIn\nProceedings of the Sixteenth International Conference on Machine Learning, pages 200\u2013209. Morgan\nKaufmann Publishers Inc., 1999.\n\n[19] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates\n\nusing the em algorithm. Applied statistics, pages 20\u201328, 1979.\n\n[20] Fabio Parisi, Francesco Strino, Boaz Nadler, and Yuval Kluger. Ranking and combining multiple predic-\n\ntors without labeled data. Proceedings of the National Academy of Sciences, 111(4):1253\u20131258, 2014.\n\n[21] Frank Moosmann, Bill Triggs, and Frederic Jurie. Fast discriminative visual codebooks using randomized\nclustering forests. In Twentieth Annual Conference on Neural Information Processing Systems (NIPS\u201906),\npages 985\u2013992. MIT Press, 2007.\n\n[22] Yoav Freund, Robert E Schapire, Yoram Singer, and Manfred K Warmuth. Using and combining predic-\ntors that specialize. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing,\npages 334\u2013343. ACM, 1997.\n\n[23] Predicting a Biological Response. 2012. https://www.kaggle.com/c/bioresponse.\n[24] Give Me Some Credit. 2011. https://www.kaggle.com/c/GiveMeSomeCredit.\n\n9\n\n\f", "award": [], "sourceid": 829, "authors": [{"given_name": "Akshay", "family_name": "Balsubramani", "institution": "Ucsd"}, {"given_name": "Yoav", "family_name": "Freund", "institution": "UC San Diego"}]}