{"title": "Mondrian Forests: Efficient Online Random Forests", "book": "Advances in Neural Information Processing Systems", "page_first": 3140, "page_last": 3148, "abstract": "Ensembles of randomized decision trees, usually referred to as random forests, are widely used for classification and regression tasks in machine learning and statistics. Random forests achieve competitive predictive performance and are computationally efficient to train and test, making them excellent candidates for real-world prediction tasks. The most popular random forest variants (such as Breiman's random forest and extremely randomized trees) operate on batches of training data. Online methods are now in greater demand. Existing online random forests, however, require more training data than their batch counterpart to achieve comparable predictive performance. In this work, we use Mondrian processes (Roy and Teh, 2009) to construct ensembles of random decision trees we call Mondrian forests. Mondrian forests can be grown in an incremental/online fashion and remarkably, the distribution of online Mondrian forests is the same as that of batch Mondrian forests. Mondrian forests achieve competitive predictive performance comparable with existing online random forests and periodically re-trained batch random forests, while being more than an order of magnitude faster, thus representing a better computation vs accuracy tradeoff.", "full_text": "Mondrian Forests: Ef\ufb01cient Online Random Forests\n\nBalaji Lakshminarayanan\n\nGatsby Unit\n\nUniversity College London\n\nDaniel M. Roy\n\nDepartment of Engineering\nUniversity of Cambridge\n\nYee Whye Teh\n\nDepartment of Statistics\nUniversity of Oxford\n\nAbstract\n\nEnsembles of randomized decision trees, usually referred to as random forests,\nare widely used for classi\ufb01cation and regression tasks in machine learning and\nstatistics. Random forests achieve competitive predictive performance and are\ncomputationally ef\ufb01cient to train and test, making them excellent candidates for\nreal-world prediction tasks. The most popular random forest variants (such as\nBreiman\u2019s random forest and extremely randomized trees) operate on batches\nof training data. Online methods are now in greater demand. Existing online\nrandom forests, however, require more training data than their batch counterpart\nto achieve comparable predictive performance. In this work, we use Mondrian\nprocesses (Roy and Teh, 2009) to construct ensembles of random decision trees\nwe call Mondrian forests. Mondrian forests can be grown in an incremental/online\nfashion and remarkably, the distribution of online Mondrian forests is the same as\nthat of batch Mondrian forests. Mondrian forests achieve competitive predictive\nperformance comparable with existing online random forests and periodically re-\ntrained batch random forests, while being more than an order of magnitude faster,\nthus representing a better computation vs accuracy tradeoff.\n\nIntroduction\n\n1\nDespite being introduced over a decade ago, random forests remain one of the most popular machine\nlearning tools due in part to their accuracy, scalability, and robustness in real-world classi\ufb01cation\ntasks [3]. (We refer to [6] for an excellent survey of random forests.) In this paper, we introduce a\nnovel class of random forests\u2014called Mondrian forests (MF), due to the fact that the underlying tree\nstructure of each classi\ufb01er in the ensemble is a so-called Mondrian process. Using the properties of\nMondrian processes, we present an ef\ufb01cient online algorithm that agrees with its batch counterpart at\neach iteration. Not only are online Mondrian forests faster and more accurate than recent proposals\nfor online random forest methods, but they nearly match the accuracy of state-of-the-art batch random\nforest methods trained on the same dataset.\nThe paper is organized as follows: In Section 2, we describe our approach at a high-level, and in\nSections 3, 4, and 5, we describe the tree structures, label model, and incremental updates/predictions\nin more detail. We discuss related work in Section 6, demonstrate the excellent empirical performance\nof MF in Section 7, and conclude in Section 8 with a discussion about future work.\n\n2 Approach\nGiven N labeled examples (x1, y1), . . . , (xN , yN ) 2 RD \u21e5Y as training data, our task is to predict\nlabels y 2Y for unlabeled test points x 2 RD. We will focus on multi-class classi\ufb01cation where\nY := {1, . . . , K}, however, it is possible to extend the methodology to other supervised learning tasks\nsuch as regression. Let X1:n := (x1, . . . , xn), Y1:n := (y1, . . . , yn), and D1:n := (X1:n, Y1:n).\nA Mondrian forest classi\ufb01er is constructed much like a random forest: Given training data D1:N,\nwe sample an independent collection T1, . . . , TM of so-called Mondrian trees, which we will\ndescribe in the next section. The prediction made by each Mondrian tree Tm is a distribution\npTm(y|x,D1:N ) over the class label y for a test point x. The prediction made by the Mondrian\nforest is the average 1\nm=1 pTm(y|x,D1:N ) of the individual tree predictions. As M ! 1, the\n\nMPM\n\n\faverage converges at the standard rate to the expectation ET\u21e0MT(,D1:N )[ pT (y|x,D1:N )], where\nMT (,D1:N ) is the distribution of a Mondrian tree. As the limiting expectation does not depend on\nM, we would not expect to see over\ufb01tting behavior as M increases. A similar observation was made\nby Breiman in his seminal article [2] introducing random forests. Note that the averaging procedure\nabove is ensemble model combination and not Bayesian model averaging.\nIn the online learning setting, the training examples are presented one after another in a sequence\nof trials. Mondrian forests excel in this setting: at iteration N + 1, each Mondrian tree T \u21e0\nMT (,D1:N ) is updated to incorporate the next labeled example (xN +1, yN +1) by sampling an\nextended tree T 0 from a distribution MTx(, T,DN +1). Using properties of the Mondrian process,\nwe can choose a probability distribution MTx such that T 0 = T on D1:N and T 0 is distributed\naccording to MT (,D1:N +1), i.e.,\n\nT \u21e0 MT (,D1:N )\n\nT 0 | T,D1:N +1 \u21e0 MTx(, T,DN +1)\n\nimplies\n\nT 0 \u21e0 MT (,D1:N +1) .\n\n(1)\n\nTherefore, the distribution of Mondrian trees trained on a dataset in an incremental fashion is the\nsame as that of Mondrian trees trained on the same dataset in a batch fashion, irrespective of the\norder in which the data points are observed. To the best of our knowledge, none of the existing online\nrandom forests have this property. Moreover, we can sample from MTx(, T,DN +1) ef\ufb01ciently: the\ncomplexity scales with the depth of the tree, which is typically logarithmic in N.\nWhile treating the online setting as a sequence of larger and larger batch problems is normally\ncomputationally prohibitive, this approach can be achieved ef\ufb01ciently with Mondrian forests. In the\nfollowing sections, we de\ufb01ne the Mondrian tree distribution MT (,D1:N ), the label distribution\npT (y|x,D1:N ), and the update distribution MTx(, T,DN +1).\n3 Mondrian trees\nFor our purposes, a decision tree on RD will be a hierarchical, binary partitioning of RD and a rule\nfor predicting the label of test points given training data. More carefully, a rooted, strictly-binary\ntree is a \ufb01nite tree T such that every node in T is either a leaf or internal node, and every node is the\nchild of exactly one parent node, except for a distinguished root node, represented by \u270f, which has no\nparent. Let leaves(T) denote the set of leaf nodes in T. For every internal node j 2 T \\ leaves(T),\nthere are exactly two children nodes, represented by left(j) and right(j). To each node j 2 T, we\nassociate a block Bj \u2713 RD of the input space as follows: We let B\u270f := RD. Each internal node\nj 2 T\\ leaves(T) is associated with a splitj,\u21e0 j, where j 2{ 1, 2, . . . , D} denotes the dimension\nof the split and \u21e0j denotes the location of the split along dimension j. We then de\ufb01ne\nand Bright(j) := {x 2 Bj : xj >\u21e0 j}.\nWe may write Bj =`j1, uj1\u21e4 \u21e5 . . . \u21e5`jD, ujD\u21e4, where `jd and ujd denote the `ower and upper\nbounds, respectively, of the rectangular block Bj along dimension d. Put `j = {`j1,` j2, . . . ,` jD}\nand uj = {uj1, uj2, . . . , ujD}. The decision tree structure is represented by the tuple T = (T, , \u21e0).\nWe refer to Figure 1(a) for a simple illustration of a decision tree.\nIt will be useful to introduce some additional notation. Let parent(j) denote the parent of node j. Let\nN (j) denote the indices of training data points at node j, i.e., N (j) = {n 2{ 1, . . . , N} : xn 2 Bj}.\nLet DN (j) = {XN (j), YN (j)} denote the features and labels of training data points at node j. Let\njd and ux\njd denote the lower and upper bounds of training data points (hence the superscript x)\n`x\njD\u21e4 \u2713 Bj denote the\nrespectively in node j along dimension d. Let Bx\nsmallest rectangle that encloses the training data points in node j.\n\nj1\u21e4 \u21e5 . . . \u21e5`x\n\nBleft(j) := {x 2 Bj : xj \uf8ff \u21e0j}\n\nj1, ux\n\nj =`x\n\njD, ux\n\n(2)\n\n3.1 Mondrian process distribution over decision trees\nMondrian processes, introduced by Roy and Teh [19], are families {Mt : t 2 [0,1)} of random,\nhierarchical binary partitions of RD such that Mt is a re\ufb01nement of Ms whenever t > s.1 Mondrian\nprocesses are natural candidates for the partition structure of random decision trees, but Mondrian\n1Roy and Teh [19] studied the distribution of {Mt : t \uf8ff } and refered to  as the budget. See [18, Chp. 5]\n\nfor more details. We will refer to t as time, not be confused with discrete time in the online learning setting.\n\n2\n\n\fx1 > 0.37\n\nx2 > 0.5\n\n1\n\nx2\n\n \n\nBj\n\n\u2305\nF\n\n\u2305\n\nF\n\n \n\n0\n\n\n\n0.42\n\n0.7\n\n\n\n\n\nx1 > 0.37\n\nx2 > 0.5\n\n1\n\nx2\n\n \n\nBx\nj\n\n \n\n\u2305\nF\n\n\u2305\n\nF\n\n , \n\nF,F\n\n\u2305,\u2305\n\n0\n\nx1\n\n1\n\n1\n\n\n\n , \n\nF,F\n\n\u2305,\u2305\n\n0\n\nx1\n\n1\n\n(a) Decision Tree\n\n(b) Mondrian Tree\n\nFigure 1: Example of a decision tree in [0, 1]2 where x1 and x2 denote horizontal and vertical axis respectively:\nFigure 1(a) shows tree structure and partition of a decision tree, while Figure 1(b) shows a Mondrian tree. Note\nthat the Mondrian tree is embedded on a vertical time axis, with each node associated with a time of split and\nthe splits are committed only within the range of the training data in each block (denoted by gray rectangles).\nLet j denote the left child of the root: Bj = (0, 0.37] \u21e5 (0, 1] denotes the block associated with red circles and\nj \u2713 Bj is the smallest rectangle enclosing the two data points.\nBx\nprocesses on RD are, in general, in\ufb01nite structures that we cannot represent all at once. Because we\nonly care about the partition on a \ufb01nite set of observed data, we introduce Mondrian trees, which\nare restrictions of Mondrian processes to a \ufb01nite set of points. A Mondrian tree T can be represented\nby a tuple (T, , \u21e0, \u2327 ), where (T, , \u21e0) is a decision tree and \u2327 = {\u2327j}j2T associates a time of split\n\u2327j  0 with each node j. Split times increase with depth, i.e., \u2327j >\u2327 parent(j). We abuse notation and\nde\ufb01ne \u2327parent(\u270f) = 0.\nGiven a non-negative lifetime parameter  and training data D1:n, the generative process for sampling\nMondrian trees from MT (,D1:n) is described in the following two algorithms:\nAlgorithm 1 SampleMondrianTree,D1:n\n1: Initialize: T = ;, leaves(T) = ;,  = ;, \u21e0 = ;, \u2327 = ;, N (\u270f) = {1, 2, . . . , n}\n2: SampleMondrianBlock\u270f,DN (\u270f),\nAlgorithm 2 SampleMondrianBlockj,DN (j),\n3: Sample E from exponential distribution with ratePd(ux\n\n4: if \u2327parent(j) + E < then\nSet \u2327j = \u2327parent(j) + E\n5:\nSample split dimension j, choosing d with probability proportional to ux\n6:\nSample split location \u21e0j uniformly from interval [`x\n7:\nSet N (left(j)) = {n 2 N (j) : Xn,j \uf8ff \u21e0j} and N (right(j)) = {n 2 N (j) : Xn,j >\u21e0 j}\n8:\nSampleMondrianBlockleft(j),DN (left(j)),\n9:\nSampleMondrianBlockright(j),DN (right(j)),\n10:\n11: else\n12:\n\n. dimension-wise min and max\njd  `x\njd)\n\n1: Add j to T\n2: For all d, set `x\n\nSet \u2327j =  and add j to leaves(T)\n\njd = min(XN (j),d), ux\n\n.j is an internal node\n\njd = max(XN (j),d)\n\n.j is a leaf node\n\n. Algorithm 2\n\njd  `x\n\njj , ux\n\njj ]\n\njd\n\n\u270f and ux\n\n\u270fd  `x\n\n\u270f i.e. the lower and upper bounds of Bx\n\n\u270f , given byPd(ux\nE[E] = 1/Pd(ux\n\nThe procedure starts with the root node \u270f and recurses down the tree. In Algorithm 2, we \ufb01rst\ncompute the `x\n\u270f , the smallest rectangle enclosing\nXN (\u270f). We sample E from an exponential distribution whose rate is the so-called linear dimension\n\u270fd). Since \u2327parent(\u270f) = 0, E + \u2327parent(\u270f) = E. If E  , the time of split\nof Bx\nis not within the lifetime ; hence, we assign \u270f to be a leaf node and the procedure halts. (Since\njd), bigger rectangles are less likely to be leaf nodes.) Else, \u270f is an internal\n\nnode and we sample a split (\u270f,\u21e0 \u270f) from the uniform split distribution on Bx\n\ufb01rst sample the dimension \u270f, taking the value d with probability proportional to ux\nsample the split location \u21e0\u270f uniformly from the interval [`x\nalong the left and right children.\nMondrian trees differ from standard decision trees (e.g. CART, C4.5) in the following ways: (i)\nthe splits are sampled independent of the labels YN (j); (ii) every node j is associated with a split\n\n\u270f . More precisely, we\n\u270fd, and then\n\u270f\u270f]. The procedure then recurses\n\njd  `x\n\n\u270fd  `x\n\n\u270f\u270f, ux\n\n3\n\n\fj and not the whole of Bj. No commitment is made in Bj \\ Bx\n\ntime denoted by \u2327j; (iii) the lifetime parameter  controls the total number of splits (similar to the\nmaximum depth parameter for standard decision trees); (iv) the split represented by an internal node\nj holds only within Bx\nj . Figure 1\nillustrates the difference between decision trees and Mondrian trees.\nConsider the family of distributions MT (, F ), where F ranges over all possible \ufb01nite sets of data\npoints. Due to the fact that these distributions are derived from that of a Mondrian process on RD\nrestricted to a set F of points, the family MT (,\u00b7) will be projective. Intuitively, projectivity implies\nthat the tree distributions possess a type of self-consistency. In words, if we sample a Mondrian\ntree T from MT (, F ) and then restrict the tree T to a subset F 0 \u2713 F of points, then the restricted\ntree T 0 has distribution MT (, F 0). Most importantly, projectivity gives us a consistent way to\nextend a Mondrian tree on a data set D1:N to a larger data set D1:N +1. We exploit this property\nto incrementally grow a Mondrian tree: we instantiate the Mondrian tree on the observed training\ndata points; upon observing a new data point DN +1, we extend the Mondrian tree by sampling from\nthe conditional distribution of a Mondrian tree on D1:N +1 given its restriction to D1:N, denoted\nby MTx(, T,DN +1) in (1). Thus, a Mondrian process on RD is represented only where we have\nobserved training data.\n\n4 Label distribution: model, hierarchical prior, and predictive posterior\nSo far, our discussion has been focused on the tree structure. In this section, we focus on the predictive\nlabel distribution, pT (y|x,D1:N ), for a tree T = (T, , \u21e0, \u2327 ), dataset D1:N, and test point x. Let\nleaf(x) denote the unique leaf node j 2 leaves(T) such that x 2 Bj. Intuitively, we want the\npredictive label distribution at x to be a smoothed version of the empirical distribution of labels\nfor points in Bleaf(x) and in Bj0 for nearby nodes j0. We achieve this smoothing via a hierarchical\nBayesian approach: every node is associated with a label distribution, and a prior is chosen under\nwhich the label distribution of a node is similar to that of its parent\u2019s. The predictive pT (y|x,D1:N )\nis then obtained via marginalization.\nAs is common in the decision tree literature, we assume the labels within each block are independent\nof X given the tree structure. For every j 2 T, let Gj denote the distribution of labels at node j, and\nlet G = {Gj : j 2 T} be the set of label distributions at all the nodes in the tree. Given T and G,\nthe predictive label distribution at x is p(y|x, T,G) = Gleaf(x), i.e., the label distribution at the node\nleaf(x). In this paper, we focus on the case of categorical labels taking values in the set {1, . . . , K},\nand so we abuse notation and write Gj,k for the probability that a point in Bj is labeled k.\nWe model the collection Gj, for j 2 T, as a hierarchy of normalized stable processes (NSP) [24]. A\nNSP prior is a distribution over distributions and is a special case of the Pitman-Yor process (PYP)\nprior where the concentration parameter is taken to zero [17].2 The discount parameter d 2 (0, 1)\ncontrols the variation around the base distribution; if Gj \u21e0 NSP(d, H), then E[Gjk] = Hk and\nVar[Gjk] = (1  d)Hk(1  Hk). We use a hierarchical NSP (HNSP) prior over Gj as follows:\n\nG\u270f|H \u21e0 NSP(d\u270f, H),\n\nand\n\n(3)\n\nThis hierarchical prior was \ufb01rst proposed by Wood et al. [24]. Here we take the base distribution H\n\nGj|Gparent(j) \u21e0 NSP(dj, Gparent(j)).\nto be the uniform distribution over the K labels, and set dj = exp(\u2327j  \u2327parent(j)).\nGiven training data D1:N, the predictive distribution pT (y|x,D1:N ) is obtained by integrating over G,\ni.e., pT (y|x,D1:N ) = EG\u21e0pT (G|D1:N )[Gleaf(x),y] = Gleaf(x),y, where the posterior pT (G|D1:N ) /\npT (G)QN\nn=1 Gleaf(xn),yn. Posterior inference in the HNSP, i.e., computation of the posterior means\nGleaf(x), is a special case of posterior inference in the hierarchical PYP (HPYP). In particular, Teh\n[22] considers the HPYP with multinomial likelihood (in the context of language modeling). The\nmodel considered here is a special case of [22]. Exact inference is intractable and hence we resort to\napproximations. In particular, we use a fast approximation known as the interpolated Kneser-Ney\n(IKN) smoothing [22], a popular technique for smoothing probabilities in language modeling [13].\nThe IKN approximation in [22] can be extended in a straightforward fashion to the online setting,\nand the computational complexity of adding a new training instance is linear in the depth of the tree.\nWe refer the reader to Appendix A for further details.\n\n2Taking the discount parameter to zero leads to a Dirichlet process . Hierarchies of NSPs admit more tractable\n\napproximations than hierarchies of Dirichlet processes [24], hence our choice here.\n\n4\n\n\f5 Online training and prediction\nIn this section, we describe the family of distributions MTx(, T,DN +1), which are used to incre-\nmentally add a data point, DN +1, to a tree T . These updates are based on the conditional Mondrian\nalgorithm [19], specialized to a \ufb01nite set of points. In general, one or more of the following three\noperations may be executed while introducing a new data point: (i) introduction of a new split \u2018above\u2019\nan existing split, (ii) extension of an existing split to the updated extent of the block and (iii) splitting\nan existing leaf node into two children. To the best of our knowledge, existing online decision trees\nuse just the third operation, and the \ufb01rst two operations are unique to Mondrian trees. The complete\npseudo-code for incrementally updating a Mondrian tree T with a new data point D according to\nMTx(, T,D) is described in the following two algorithms. Figure 2 walks through the algorithms\non a toy dataset.\n\nAlgorithm 3 ExtendMondrianTree(T,, D)\n1: Input: Tree T = (T, , \u21e0, \u2327 ), new training instance D = (x, y)\n2: ExtendMondrianBlock(T,,\u270f, D)\nAlgorithm 4 ExtendMondrianBlock(T,, j, D)\n1: Set e` = max(`x\n2: Sample E from exponential distribution with ratePd(e`\n\nj  x, 0) and eu = max(x  ux\n\nj , 0)\n\n. Algorithm 4\n\n. e` = eu = 0D if x 2 Bx\nd + eu\nd )\n. introduce new parent for node j\n\nj\n\n3: if \u2327parent(j) + E <\u2327 j then\n4:\n5:\n6:\n7:\n8:\n9:\n10: else\n11:\n12:\n13:\n14:\n\nSample split dimension , choosing d with probability proportional to e`\nSample split location \u21e0 uniformly from interval [ux\nInsert a new node \u02dc| just above node j in the tree, and a new leaf j00, sibling to j, where\n\nj,, x] if x > ux\n\nj, else [x,` x\n\nd + eu\nd\n\nj,].\n\nj , x)\n\nj , x)\n\nj , x), ux\n\nj , x), ux\n\n\u02dc| = min(`x\n\n\u02dc| = max(ux\n\nj max(ux\n\n\u02dc| = , \u21e0\u02dc| = \u21e0, \u2327\u02dc| = \u2327parent(j) + E, `x\nj00 = left(\u02dc|) iff x\u02dc| \uf8ff \u21e0\u02dc|\nSampleMondrianBlockj00,D,\nUpdate `x\nj min(`x\nif j /2 leaves(T) then\nif xj \uf8ff \u21e0j then child(j) = left(j) else child(j) = right(j)\nExtendMondrianBlock(T,, child(j),D)\n\n. update extent of node j\n. return if j is a leaf node, else recurse down the tree\n. recurse on child containing D\nIn practice, random forest implementations stop splitting a node when all the labels are identical and\nassign it to be a leaf node. To make our MF implementation comparable, we \u2018pause\u2019 a Mondrian\nblock when all the labels are identical; if a new training instance lies within Bj of a paused leaf\nnode j and has the same label as the rest of the data points in Bj, we continue pausing the Mondrian\nblock. We \u2018un-pause\u2019 the Mondrian block when there is more than one unique label in that block.\nAlgorithms 9 and 10 in the supplementary material discuss versions of SampleMondrianBlock and\nExtendMondrianBlock for paused Mondrians.\nPrediction using Mondrian tree Let x denote a test data point. If x is already \u2018contained\u2019 in\nthe tree T , i.e., if x 2 Bx\nj for some leaf j 2 leaves(T), then the prediction is taken to be Gleaf(x).\nOtherwise, we somehow need to incorporate x. One choice is to extend T by sampling T 0 from\nMTx(, T, x) as described in Algorithm 3, and set the prediction to Gj, where j 2 leaves(T0) is the\nleaf node containing x. A particular extension T 0 might lead to an overly con\ufb01dent prediction; hence,\nwe average over every possible extension T 0. This integration can be carried out analytically and the\ncomputational complexity is linear in the depth of the tree. We refer to Appendix B for further details.\n\n6 Related work\nThe literature on random forests is vast and we do not attempt to cover it comprehensively; we provide\na brief review here and refer to [6] and [8] for a recent review of random forests in batch and online\nsettings respectively. Classic decision tree induction procedures choose the best split dimension and\nlocation from all candidate splits at each node by optimizing some suitable quality criterion (e.g.\ninformation gain) in a greedy manner. In a random forest, the individual trees are randomized to\nde-correlate their predictions. The most common strategies for injecting randomness are (i) bagging\n[1] and (ii) randomly subsampling the set of candidate splits within each node.\n\n5\n\n\f1\n\nx2\n\n0\n\n1\n\nx2\n\n c\n\n1\n\nx2\n\n b\n\n a\n\n b\n\n a\n\nx1\n(a)\n\n1\n\n0\n\nx1\n(b)\n\n1\n\n0\n\n c\n\n1\n\nx2\n\n a\n\n1\n\n0\n\n d\n\n b\n\nx1\n(d)\n\n c\n\n1\n\nx2\n\n a\n\n1\n\n0\n\n d\n\n b\n\nx1\n(e)\n\n c\n\n1\n\nx2\n\n a\n\n1\n\n0\n\n c\n\n1\n\n d\n\n b\n\nx1\n(f)\n\nx1 > 0.75\n\n b\n\n a\n\nx1\n(c)\n\nx1 > 0.75\n\n0\n\n1.01\n\n2.42\n\n3.97\n\n\n\n\n\n\n\n\n1\n\n\n\nx2 > 0.23\n\nx2 > 0.23\n\nx2 > 0.23\n\nx1 > 0.47\n\na\n(g)\n\nb\n\na\n\nb\n\nc\n\na\n\nb\n\nd\n\nc\n\n(h)\n\n(i)\n\nFigure 2: Online learning with Mondrian trees on a toy dataset: We assume that  = 1, D = 2 and add one\ndata point at each iteration. For simplicity, we ignore class labels and denote location of training data with red\ncircles. Figures 2(a), 2(c) and 2(f) show the partitions after the \ufb01rst, second and third iterations, respectively,\nwith the intermediate \ufb01gures denoting intermediate steps. Figures 2(g), 2(h) and 2(i) show the trees after the \ufb01rst,\nsecond and third iterations, along with a shared vertical time axis.\n\nAt iteration 1, we have two training data points, labeled as a, b. Figures 2(a) and 2(g) show the partition\nand tree structure of the Mondrian tree. Note that even though there is a split x2 > 0.23 at time t = 2.42, we\ncommit this split only within Bx\n\nj (shown by the gray rectangle).\n\nAt iteration 2, a new data point c is added. Algorithm 3 starts with the root node and recurses down the\n\u270f by computing the additional extent e` and eu. In\ntree. Algorithm 4 checks if the new data point lies within Bx\nthis case, c does not lie within Bx\n\u270f . Let Rab and Rabc respectively denote the small gray rectangle (enclosing\na, b) and big gray rectangle (enclosing a, b, c) in Figure 2(b). While extending the Mondrian from Rab to Rabc,\nwe could either introduce a new split in Rabc outside Rab or extend the split in Rab to the new range. To\nchoose between these two options, we sample the time of this new split: we \ufb01rst sample E from an exponential\ndistribution whose rate is the sum of the additional extent, i.e.,Pd(e`\nd ), and set the time of the new split\nto E + \u2327parent(\u270f). If E + \u2327parent(\u270f) \uf8ff \u2327\u270f, this new split in Rabc can precede the old split in Rab and a split is\nsampled in Rabc outside Rab. In Figures 2(c) and 2(h), E + \u2327parent(\u270f) = 1.01 + 0 \uf8ff 2.42, hence a new split\nj , the higher the ratePd(e`\nx1 > 0.75 is introduced. The farther a new data point x is from Bx\nd ), and\nd ). A\nsubsequently the higher the probability of a new split being introduced, since E[E] = 1/Pd(e`\nIn the \ufb01nal iteration, we add data point d. In Figure 2(d), the data point d lies within the extent of the root\nnode, hence we traverse to the left side of the root and update Bx\nj of the internal node containing {a, b} to\ninclude d. We could either introduce a new split or extend the split x2 > 0.23. In Figure 2(e), we extend the\nsplit x2 > 0.23 to the new extent, and traverse to the leaf node in Figure 2(h) containing b. In Figures 2(f) and\n2(i), we sample E = 1.55 and since \u2327parent(j) + E = 2.42 + 1.55 = 3.97 \uf8ff  = 1, we introduce a new split\nx1 > 0.47.\n\nnew split in Rabc is sampled such that it is consistent with the existing partition structure in Rab (i.e., the new\nsplit cannot slice through Rab).\n\nd + eu\nd + eu\n\nd + eu\n\nTwo popular random forest variants in the batch setting are Breiman-RF [2] and Extremely randomized\ntrees (ERT) [12]. Breiman-RF uses bagging and furthermore, at each node, a random k-dimensional\nsubset of the original D features is sampled. ERT chooses a k dimensional subset of the features and\nthen chooses one split location each for the k features randomly (unlike Breiman-RF which considers\nall possible split locations along a dimension). ERT does not use bagging. When k = 1, the ERT\ntrees are totally randomized and the splits are chosen independent of the labels; hence the ERT-1\nmethod is very similar to MF in the batch setting in terms of tree induction. (Note that unlike ERT,\nMF uses HNSP to smooth predictive estimates and allows a test point to branch off into its own node.)\nPerfect random trees (PERT), proposed by Cutler and Zhao [7] for classi\ufb01cation problems, produce\ntotally randomized trees similar to ERT-1, although there are some slight differences [12].\nExisting online random forests (ORF-Saffari [20] and ORF-Denil [8]) start with an empty tree and\ngrow the tree incrementally. Every leaf of every tree maintains a list of k candidate splits and\nassociated quality scores. When a new data point is added, the scores of the candidate splits at the\ncorresponding leaf node are updated. To reduce the risk of choosing a sub-optimal split based on\nnoisy quality scores, additional hyper parameters such as the minimum number of data points at a\nleaf node before a decision is made and the minimum threshold for the quality criterion of the best\nsplit, are used to assess \u2018con\ufb01dence\u2019 associated with a split. Once these criteria are satis\ufb01ed at a leaf\nnode, the best split is chosen (making this node an internal node) and its two children are the new\nleaf nodes (with their own candidate splits), and the process is repeated. These methods could be\n\n6\n\n\fmemory inef\ufb01cient for deep trees due to the high cost associated with maintaining candidate quality\nscores for the fringe of potential children [8].\nThere has been some work on incremental induction of decision trees, e.g. incremental CART [5],\nITI [23], VFDT [11] and dynamic trees [21], but to the best of our knowledge, these are focused on\nlearning decision trees and have not been generalized to online random forests. We do not compare\nMF to incremental decision trees, since random forests are known to outperform single decision trees.\nBayesian models of decision trees [4, 9] typically specify a distribution over decision trees; such\ndistributions usually depend on X and lack the projectivity property of the Mondrian process. More\nimportantly, MF performs ensemble model combination and not Bayesian model averaging over\ndecision trees. (See [10] for a discussion on the advantages of ensembles over single models, and\n[15] for a comparison of Bayesian model averaging and model combination.)\n\n7 Empirical evaluation\nThe purpose of these experiments is to evaluate the predictive performance (test accuracy) of MF\nas a function of (i) fraction of training data and (ii) training time. We divide the training data into\n100 mini-batches and we compare the performance of online random forests (MF, ORF-Saffari [20])\nto batch random forests (Breiman-RF, ERT-k, ERT-1) which are trained on the same fraction of the\ntraining data. (We compare MF to dynamic trees as well; see Appendix F for more details.) Our\nscripts are implemented in Python. We implemented the ORF-Saffari algorithm as well as ERT in\nPython for timing comparisons. The scripts can be downloaded from the authors\u2019 webpages. We\ndid not implement the ORF-Denil [8] algorithm since the predictive performance reported in [8] is\nvery similar to that of ORF-Saffari and the computational complexity of the ORF-Denil algorithm is\nworse than that of ORF-Saffari. We used the Breiman-RF implementation in scikit-learn [16].3\nWe evaluate on four of the \ufb01ve datasets used in [20] \u2014 we excluded the mushroom dataset as even\nvery simple logical rules achieve > 99% accuracy on this dataset.4 We re-scaled the datasets such\nthat each feature takes on values in the range [0, 1] (by subtracting the min value along that dimension\nand dividing by the range along that dimension, where range = max  min).\nAs is common in the random forest literature [2], we set the number of trees M = 100. For Mondrian\nforests, we set the lifetime  = 1 and the HNSP discount parameter  = 10D. For ORF-Saffari, we\nset num epochs = 20 (number of passes through the training data) and set the other hyper parameters\nto the values used in [20]. For Breiman-RF and ERT, the hyper parameters are set to default values.\nWe repeat each algorithm with \ufb01ve random initializations and report the mean performance. The\nresults are shown in Figure 3. (The * in Breiman-RF* indicates scikit-learn implementation.)\nComparing test accuracy vs fraction of training data on usps, satimages and letter datasets, we\nobserve that MF achieves accuracy very close to the batch RF versions (Breiman-RF, ERT-k,\nERT-1) trained on the same fraction of the data. MF signi\ufb01cantly outperforms ORF-Saffari\ntrained on the same fraction of training data. In batch RF versions, the same training data can\nbe used to evaluate candidate splits at a node and its children. However, in the online RF versions\n(ORF-Saffari and ORF-Denil), incoming training examples are used to evaluate candidate splits just\nat a current leaf node and new training data are required to evaluate candidate splits every time a\nnew leaf node is created. Saffari et al. [20] recommend multiple passes through the training data to\nincrease the effective number of training samples. In a realistic streaming data setup, where training\nexamples cannot be stored for multiple passes, MF would require signi\ufb01cantly fewer examples than\nORF-Saffari to achieve the same accuracy.\nComparing test accuracy vs training time on usps, satimages and letter datasets, we observe that MF\nis at least an order of magnitude faster than re-trained batch versions and ORF-Saffari. For\nORF-Saffari, we plot test accuracy at the end of every additional pass; hence it contains additional\nmarkers compared to the top row which plots results after a single pass. Re-training batch RF using\n100 mini-batches is unfair to MF; in a streaming data setup where the model is updated when a\nnew training instance arrives, MF would be signi\ufb01cantly faster than the re-trained batch versions.\n\n3The scikit-learn implementation uses highly optimized C code, hence we do not compare our runtimes with\nthe scikit-learn implementation. The ERT implementation in scikit-learn achieves very similar test accuracy as\nour ERT implementation, hence we do not report those results here.\n\n4https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names\n\n7\n\n\fAssuming trees are balanced after adding each data point, it can be shown that computational cost of\nMF scales as O(N log N ) whereas that of re-trained batch RF scales as O(N 2 log N ) (Appendix C).\nAppendix E shows that the average depth of the forests trained on above datasets scales as O(log N ).\nIt is remarkable that choosing splits independent of labels achieves competitive classi\ufb01cation per-\nformance. This phenomenon has been observed by others as well\u2014for example, Cutler and Zhao\n[7] demonstrate that their PERT classi\ufb01er (which is similar to batch version of MF) achieves test\naccuracy comparable to Breiman-RF on many real world datasets. However, in the presence of\nirrelevant features, methods which choose splits independent of labels (MF, ERT-1) perform worse\nthan Breiman-RF and ERT-k (but still better than ORF-Saffari) as indicated by the results on the\ndna dataset. We trained MF and ERT-1 using just the most relevant 60 attributes amongst the 180\nattributes5\u2014these results are indicated as MF\u2020 and ERT-1\u2020 in Figure 3. We observe that, as expected,\n\ufb01ltering out irrelevant features signi\ufb01cantly improves performance of MF and ERT-1.\n\n0.92\n\n0.90\n\n0.88\n\n0.86\n\n0.84\n\n0.82\n\n0.80\n\n0.78\n\n1.00\n\n0.95\n\n0.90\n\n0.85\n\n0.80\n\n0.75\n\n0.70\n\n0.65\n\n0.60\n\n0.95\n\n0.90\n\n0.85\n\n0.80\n\n0.75\n\n0.70\n\n0.65\n\n0.60\n\n0.55\n\nMF\u2020\nERT-1\u2020\n\n0.7\n\n0.8\n\n0.9\n\n1.0\n\n0.76\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1.0\n\n0.55\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1.0\n\n0.50\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1.0\n\nMF\nERT-k\nERT-1\nORF-Saffari\nBreiman-RF*\n0.4\n0.6\n\n0.5\n\n0.95\n\n0.90\n\n0.85\n\n0.80\n\n0.75\n\n0.70\n\n0.65\n\n0.60\n\n0.1\n\n0.2\n\n0.3\n\n1.00\n\n0.95\n\n0.90\n\n0.85\n\n0.80\n\n0.75\n\n0.70\n\n0.65\n\n0.90\n\n0.85\n\n0.80\n\nMF\nERT-k\nERT-1\nORF-Saffari\n\n1.00\n\n0.95\n\n0.90\n\n0.85\n\n0.80\n\n0.75\n\n0.70\n\n0.65\n\n0.60\n\nMF\u2020\nERT-1\u2020\n\n1.1\n\n1.0\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n104\n\n0.55\n\n101\n\n102\n\n103\n\nletter\n\n104\n\n105\n\n0.5\n\n101\n\n103\n\n104\n\n102\n\ndna\n\n0.60\n\n101\n\n102\n\n103\n\nusps\n\n104\n\n105\n\n0.75\n\n101\n\n102\n\nsatimages\n\n103\n\nFigure 3: Results on various datasets: y-axis is test accuracy in both rows. x-axis is fraction of training data\nfor the top row and training time (in seconds) for the bottom row. We used the pre-de\ufb01ned train/test split.\nFor usps dataset D = 256, K = 10, Ntrain = 7291, Ntest = 2007; for satimages dataset D = 36, K =\n6, Ntrain = 3104, Ntest = 2000; letter dataset D = 16, K = 26, Ntrain = 15000, Ntest = 5000; for dna dataset\nD = 180, K = 3, Ntrain = 1400, Ntest = 1186.\n8 Discussion\nWe have introduced Mondrian forests, a novel class of random forests, which can be trained incre-\nmentally in an ef\ufb01cient manner. MF signi\ufb01cantly outperforms existing online random forests in\nterms of training time as well as number of training instances required to achieve a particular test\naccuracy. Remarkably, MF achieves competitive test accuracy to batch random forests trained on the\nsame fraction of the data. MF is unable to handle lots of irrelevant features (since splits are chosen\nindependent of the labels)\u2014one way to use labels to guide splits is via recently proposed Sequential\nMonte Carlo algorithm for decision trees [14]. The computational complexity of MF is linear in the\nnumber of dimensions (since rectangles are represented explicitly) which could be expensive for\nhigh dimensional data; we will address this limitation in future work. Random forests have been\ntremendously in\ufb02uential in machine learning for a variety of tasks; hence lots of other interesting\nextensions of this work are possible, e.g. MF for regression, theoretical bias-variance analysis of MF,\nextensions of MF that use hyperplane splits instead of axis-aligned splits.\nAcknowledgments\nWe would like to thank Charles Blundell, Gintare Dziugaite, Creighton Heaukulani, Jos\u00b4e Miguel\nHern\u00b4andez-Lobato, Maria Lomeli, Alex Smola, Heiko Strathmann and Srini Turaga for helpful\ndiscussions and feedback on drafts. BL gratefully acknowledges generous funding from the Gatsby\nCharitable Foundation. This research was carried out in part while DMR held a Research Fellowship\nat Emmanuel College, Cambridge, with funding also from a Newton International Fellowship through\nthe Royal Society. YWT\u2019s research leading to these results was funded in part by the European\nResearch Council under the European Union\u2019s Seventh Framework Programme (FP7/2007-2013)\nERC grant agreement no. 617411.\n\n5https://www.sgi.com/tech/mlc/db/DNA.names\n\n8\n\n\fReferences\n[1] L. Breiman. Bagging predictors. Mach. Learn., 24(2):123\u2013140, 1996.\n[2] L. Breiman. Random forests. Mach. Learn., 45(1):5\u201332, 2001.\n[3] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms.\n\nIn Proc. Int. Conf. Mach. Learn. (ICML), 2006.\n\n[4] H. A. Chipman, E. I. George, and R. E. McCulloch. Bayesian CART model search. J. Am. Stat.\n\nAssoc., pages 935\u2013948, 1998.\n\n[5] S. L. Crawford. Extensions to the CART algorithm. Int. J. Man-Machine Stud., 31(2):197\u2013217,\n\n1989.\n\n[6] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A uni\ufb01ed framework for\nclassi\ufb01cation, regression, density estimation, manifold learning and semi-supervised learning.\nFound. Trends Comput. Graphics and Vision, 7(2\u20133):81\u2013227, 2012.\n\n[7] A. Cutler and G. Zhao. PERT - Perfect Random Tree Ensembles. Comput. Sci. and Stat., 33:\n\n490\u2013497, 2001.\n\n[8] M. Denil, D. Matheson, and N. de Freitas. Consistency of online random forests. In Proc. Int.\n\nConf. Mach. Learn. (ICML), 2013.\n\n[9] D. G. T. Denison, B. K. Mallick, and A. F. M. Smith. A Bayesian CART algorithm. Biometrika,\n\n85(2):363\u2013377, 1998.\n\n[10] T. G. Dietterich. Ensemble methods in machine learning. In Multiple classi\ufb01er systems, pages\n\n1\u201315. Springer, 2000.\n\n[11] P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. 6th ACM SIGKDD Int.\n\nConf. Knowl. Discov. Data Min. (KDD), pages 71\u201380. ACM, 2000.\n\n[12] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Mach. Learn., 63(1):3\u201342,\n\n2006.\n\n[13] J. T. Goodman. A bit of progress in language modeling. Comput. Speech Lang., 15(4):403\u2013434,\n\n2001.\n\n[14] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Top-down particle \ufb01ltering for Bayesian\n\ndecision trees. In Proc. Int. Conf. Mach. Learn. (ICML), 2013.\n\n[15] T. P. Minka. Bayesian model averaging is not model combination. MIT Media Lab note.\n\nhttp://research.microsoft.com/en-us/um/people/minka/papers/bma.html, 2000.\n\n[16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.,\n12:2825\u20132830, 2011.\n\n[17] J. Pitman. Combinatorial stochastic processes, volume 32. Springer, 2006.\n[18] D. M. Roy. Computability, inference and modeling in probabilistic programming. PhD thesis,\n\nMassachusetts Institute of Technology, 2011. http://danroy.org/papers/Roy-PHD-2011.pdf.\n\n[19] D. M. Roy and Y. W. Teh. The Mondrian process. In Adv. Neural Inform. Proc. Syst. (NIPS),\n\nvolume 21, pages 27\u201336, 2009.\n\n[20] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof. On-line random forests. In\n\nComputer Vision Workshops (ICCV Workshops). IEEE, 2009.\n\n[21] M. A. Taddy, R. B. Gramacy, and N. G. Polson. Dynamic trees for learning and design. J. Am.\n\nStat. Assoc., 106(493):109\u2013123, 2011.\n\n[22] Y. W. Teh. A hierarchical Bayesian language model based on Pitman\u2013Yor processes. In Proc.\n21st Int. Conf. on Comp. Ling. and 44th Ann. Meeting Assoc. Comp. Ling., pages 985\u2013992.\nAssoc. for Comp. Ling., 2006.\n\n[23] P. E. Utgoff. Incremental induction of decision trees. Mach. Learn., 4(2):161\u2013186, 1989.\n[24] F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh. A stochastic memoizer for\n\nsequence data. In Proc. Int. Conf. Mach. Learn. (ICML), 2009.\n\n9\n\n\f", "award": [], "sourceid": 1619, "authors": [{"given_name": "Balaji", "family_name": "Lakshminarayanan", "institution": "Gatsby Unit, University College London"}, {"given_name": "Daniel", "family_name": "Roy", "institution": "University of Toronto"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford"}]}