{"title": "Regression-tree Tuning in a Streaming Setting", "book": "Advances in Neural Information Processing Systems", "page_first": 1788, "page_last": 1796, "abstract": "We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time $O(\\log n)$ at any time step $n$ while achieving a nearly-optimal regression rate of $\\tilde{O}(n^{-2/(2+d)})$ in terms of the unknown metric dimension $d$. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting.", "full_text": "Regression-tree Tuning in a Streaming Setting\n\nSamory Kpotufe\u2217\n\nToyota Technological Institute at Chicago\u2020\n\nfirstname@ttic.edu\n\nFrancesco Orabona\u2217\n\nToyota Technological Institute at Chicago\n\nfrancesco@orabona.com\n\nAbstract\n\nWe consider the problem of maintaining the data-structures of a partition-based\nregression procedure in a setting where the training data arrives sequentially over\ntime. We prove that it is possible to maintain such a structure in time O (log n) at\n\nany time step n while achieving a nearly-optimal regression rate of \u02dcO(cid:0)n\u22122/(2+d)(cid:1)\n\nin terms of the unknown metric dimension d. Finally we prove a new regression\nlower-bound which is independent of a given data size, and hence is more appro-\npriate for the streaming setting.\n\n1\n\nIntroduction\n\nTraditional nonparametric regression such as kernel or k-NN can be expensive to estimate given\nmodern large training data sizes. It is therefore common to resort to cheaper methods such as tree-\nbased regression which precompute the regression estimates over a partition of the data space [7].\nGiven a future query x, the estimate fn(x) simply consists of \ufb01nding the closest cell of the partition\nby traversing an appropriate tree-structure and returning the precomputed estimate. The partition\nand precomputed estimates depend on the training data and are usually maintained in batch-mode.\nWe are interested in maintaining such a partition and estimates in a real-world setting where the\ntraining data arrives sequentially over time. Our constraints are that of fast-update at every time\nstep, while maintaining a near-minimax regression error-rate at any point in time.\nThe error-rate of tree-based regression is well known to depend on the size of the partition\u2019s cells.\nWe will call this size the binwidth. The minimax-optimal binwidth \u0001n is known to be of the form\n\nO(cid:0)n\u22121/(2+d)(cid:1), assuming a training data of size n from a metric space of unknown dimension d,\nof O(cid:0)n\u22122/(2+d)(cid:1). Thus, the dimension d is the most important problem variable entering the rate\n\nand unknown Lipschitz target function f. This setting of \u0001n would then yield a minimax error rate\n\n(and the tuning of \u0001n), while other problem variables such as the Lipschitz properties of f are less\ncrucial in comparison. The main focus of this work is therefore that of adapting to the unknown d\nwhile maintaining fast partition estimates in a streaming setting.\nA \ufb01rst idea would be to start with an initial dimension estimation phase (where the regression esti-\nmates are suboptimal), and using the estimated dimension for subsequent data in a following phase,\nwhich leaves only the problem of maintaining partition estimates over time. However, while this\nsounds reasonable, it is generally unclear when to con\ufb01dently stop such an initial phase since this\nwould depend on the unknown d and the distribution of the data.\nOur solution is to interleave dimension estimation with regression updates as the data arrives se-\nquentially. However the algorithm never relies on the estimated dimensions and views them rather\nas guesses di. Even if di (cid:54)= d, it is kept as long as it is not hurtful to regression performance. The\nguess di is discarded once we detect that it hurts the regression, a new di+1 is then estimated and a\nnew phase i+1 is started. The decision to discard di relies on monitoring quantities that play into the\ntradeoff between regression variance and bias, more precisely we monitor the size of the partition\n\n\u2217SK and FO contributed equally to this paper.\n\u2020Other af\ufb01liation: Max Planck Institute for Intelligent Systems, Germany\n\n1\n\n\fand the partition\u2019s binwidth \u0001n. We note that the idea can be applied to other forms of regression\nwhere other quantities control the regression variance and bias (see longer version of the paper).\n\n1.1 Technical Overview of Results\n\nthose of [2, 10]. We prove that the L2 error of the algorithm is \u02dcO(cid:0)n\u22122/(2+d)(cid:1), nearly optimal in\n\nWe assume that training data (xi, Yi) is sampled sequentially over time, xi belongs to a general\nmetric space X of unknown dimension d, and Yi is real. The exact setup is given in Section 2.\nThe algorithm (presented in Section 2.3) maintains regression estimates for all training samples\nxn (cid:44) {xt}n\nt=1 arriving over time, while constantly updating a partition of the data and partition\nbinwidth. At any time t = n, all updates are proveably of order log n with constants depending on\nthe unknown dimension d of X .\nAt time t = n, the estimate for a query point x is given by the precomputed estimate for the closest\npoint to x in xn, which can be found fast using an off-the-shelf similarity search structure, such as\nterms of the unknown dimension d of the metric X .\nFinally, we prove a new lower-bound for regression on a generic metric X , where the worst-case\ndistribution is the same as n increases. Note that traditional lower-bounds for the of\ufb02ine setting\nderive a different worst-case distribution for each sample size n. Thus, our lower-bound is more\nappropriate to the streaming setting where the data arrives over time from the same distribution.\nThe results are discussed in more technical details in Section 3.\n\n1.2 Related Work\n\nAlthough various interesting heuristics have been proposed for maintaining tree-based learners in\nstreaming settings (see e.g. [1, 5, 11, 15]), the problem has not received much theoretical attention.\nThis is however an important problem given the growing size of modern datasets, and given that in\nmany modern applications, training data is actually acquired sequentially over time and incremental\nupdates have to be ef\ufb01cient (see e.g. Robotics [12, 16], Finance [8]).\nThe most closely related theoretical work is that of [6] which treats the problem of tuning a local\npolynomial regressor where the training data is acquired over time. Their setting however is that\nof a Euclidean space where d is known (ambient Euclidean dimension). [6] is thus concerned with\nmaintaining a minimax error rate w.r.t. the known dimension d, while ef\ufb01ciently tuning regression\nbandwidth.\nA possible alternative to the method analyzed here is to employ some form of cross-validation or\neven online solutions based on mixture of experts [3], by keeping track of different partitions, each\ncorresponding to some setting of the bindwidth \u0001n. This is however likely expensive to maintain in\npractice if good prediction performance is desired.\n\n2 Preliminaries\n\n2.1 Notions of metric dimension\n\nWe consider the following notion of dimension which extends traditional notions of dimension (e.g.\nEuclidean dimension and manifold dimension) to general metric spaces [4]. We assume throughout,\nw.l.o.g. that the space X has diameter at most 1 under a metric \u03c1.\nDe\ufb01nition 1. The metric measure space (X , \u00b5, \u03c1) has metric measure-dimension d, if there exist\n\u02c7C\u00b5, \u02c6C\u00b5 such that for all \u0001 > 0, and x \u2208 X , \u02c7C\u00b5\u0001d \u2264 \u00b5(B(x, \u0001)) \u2264 \u02c6C\u00b5\u0001d.\nThe assumption of \ufb01nite metric-measure dimension ensures that the measure \u00b5 has mass everywhere\non the space \u03c1. This assumption is a generalization (to a metric space) of common assumptions\nwhere the measure has an upper and lower-bounded density on a compact Euclidean space, however\nis more general in that it does not require the measure \u00b5 to have a density (relative to any reference\nmeasure). The metric-measure dimension implies the following other notion of metric dimension.\nDe\ufb01nition 2. The metric (X , \u03c1) has metric dimension d, if there exists \u02c6C\u03c1 such that, for all \u0001 > 0,\nX has an \u0001-cover of size at most \u02c6C\u03c1\u0001\u2212d.\n\n2\n\n\fThe relation between the two notions of dimension is stated in the following lemma of [9], which\nallows us to use either notion as needed.\nLemma 1 ([9]). If (X , \u00b5, \u03c1) has metric-measure dimension d, then there exists \u02c7C\u03c1, \u02c6C\u03c1 such that, for\nall \u0001 > 0, any ball B(x, r) centered on (X , \u03c1) has an \u0001r-cover of size in [ \u02c7C\u03c1\u0001\u2212d, \u02c6C\u03c1\u0001\u2212d].\n\n2.2 Problem Setup\n\nWe receive data pairs (x1, Y1), (x2, Y2), . . . sequentially, i.i.d. The input xt belongs to a metric\nmeasure space (X , \u03c1, \u00b5) of diameter at most 1, and of metric measure dimension d. The output Yt\nbelongs to a subset of R of bounded diameter \u2206Y , and satis\ufb01es Yt = f (xt) + \u03b7(xt). The noise\n\u03b7(xt) has 0 mean. The unknown function f is assumed to be \u03bb-Lipschitz w.r.t. \u03c1 for an unknown\nparameter \u03bb, that is \u2200x, x(cid:48) \u2208 X ,\nL2 error: Our main performance result bounds the excess L2 risk\n\n|f (x) \u2212 f (x(cid:48))| \u2264 \u03bb\u03c1 (x, x(cid:48)).\n\nE\n\nxn,Y n\n\n(cid:107)fn \u2212 f(cid:107)2\n\n2,\u00b5\n\n(cid:44) E\n\nxn,Y n\n\nE\nX\n\n|fn(X) \u2212 f (X)|2 .\n\nWe will often also be interested in the average error on the training sample: recall that at any time t,\nan estimate ft(xs) of f is produced for every xs \u2208 xt. The average error on xn at t = n is denoted\n\nn(cid:88)\n\nt=1\n\n|fn(xt) \u2212 f (xt)|2 .\n\nE n E\n\nY n\n\n|fn(X) \u2212 f (X)|2 (cid:44) 1\nn\n\n2.3 Algorithm\n\n\u22121/(2+di)\ni\n\nThe procedure (Algorithm 1) works by partitioning the data into small regions of size roughly \u0001t/2 at\nany time t, and computing the regression estimate of the centers of each region. All points falling in\nthe same region (identi\ufb01ed by a center point) are assigned the same regression estimate: the average\nY values of all points in the region.\nThe procedure works in phases, where each Phase i corresponds to a guess di of the metric dimen-\nsion d. Where \u0001t might have been set to t\u22121/(2+d) if we knew d, we set it to t\nwhere ti is\nthe current time step within Phase i.\nWe ensure that in each phase our guess di does not hurt the variance-bias tradeoff of the estimates:\nthis is done by monitoring the size of the partition (|Xi| in the algorithm), which controls the vari-\nance (see analysis in Section 4), relative to the bindwidth \u0001t, which controls bias. Whenever |Xi| is\ntoo large relative to \u0001t, the variance of the procedure is likely too large, so we start a new phase with\nan new guess of di.\nSince the algorithm maintains at any time n an estimate fn(xt) for all xt \u2208 xn, for any query point\nx \u2208 X , we simply return fn(x) = fn(xt) where xt is the closest point to x in xn.\nDespite having to adaptively tune to the unknown d, the main computation at each time step con-\nsists of just a 2-approximate nearest neighbor search for the closest center. These searches can be\ndone fast in time O (log n) by employing the off-the-shelf online search-procedure of [10]. This is\nemphasized in Lemma 2 below.\nFinally, the algorithm employs a constant \u00afC which is assumed to upper-bound the constant C\u03c1 in\nDe\ufb01nition 2. This is a minor assumption since C\u03c1 is generally taken to be small, e.g. 1, in machine\nlearning literature, and is exactly quanti\ufb01eable for various metrics [4, 10].\n\n3 Discussion of Results\n\n3.1 Time complexity\n\nThe time complexity of updates is emphasized in the following Lemma.\nLemma 2. Suppose (X , \u03c1, \u00b5) has metric dimension d. Then there exist C depending on d such that\nall computations of the algorithm at any time t = n can be done in time C log n.\n\n3\n\n\f\u22121/(2+di)\ni\n\nReceive (xt, yt)\nti \u2190 ti + 1 // counts the time steps within Phase i\n\u0001t \u2190 t\nSet xs \u2208 Xi to the 2-approximate nearest neighbor of xt\nif \u03c1 (xt, xs) \u2264 \u0001t then\n\nAlgorithm 1 Incremental tree-based regressor.\n1: Initialize: i = 1, di = 1, ti = 0, Centers Xi = \u2205\n2: for t = 1, 2, . . . , T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\nend if\n18:\n19: end for\n\nAssign xt to xs\nfn(xs) \u2190 update average Y for center xs with yt\nFor every r \u2264 t assigned to xs, fn(xr) = fn(xs)\nif |Xi| + 1 > \u02c6C 4di \u0001\n\nend if\nAdd xt as a new center in Xi\n\n\u2212di\nt\n// Start of Phase i + 1\n|Xi|+1\n\ndi+1 \u2190(cid:108)\n\nlog(\n\ni \u2190 i + 1\n\n)/ log(4/\u0001t)\n\n\u02c6C\n\nelse\n\nthen\n\n(cid:109)\n\nFigure 1: As \u0001t varies over\ntime, a ball around a cen-\nter xs can eventually contain\nboth points assigned to xs\nand points non-assigned to it,\nand even contain other cen-\nters. This results in a complex\npartitioning of the data.\n\nProof. The main computation at time n consists of \ufb01nding the 2-approximate nearest neighbor xn in\nXi and update the data structure for the nearest neighbor search. These centers are all at least \u0001n\n2 far-\napart. Using the results of [10], this can be done online in time O (log(1/\u0001n) + log log(1/\u0001n)).\n\n3.2 Convergence rates\n\nThe main theorem below bounds the L2 error of the algorithm at any given point in time. The main\ndif\ufb01culty is in the fact that the data is partitioned in a complicated way due to the ever-changing\nbindwidth \u0001t: every ball around a center can eventually contain both points assigned to the center\nand points not assigned to the center, in fact can contain other centers (see Figure 1). This makes\nit hard to get a handle on the number of points assigned to a single center xt (contributing to the\nvariance of fn(xt)) and the distance between points assigned to the same center (contributing to the\nbias). This is not the case in classical analyses of tree-based regression since the data partitioning is\nusually clearly de\ufb01ned.\nThe problem is handled by \ufb01rst looking at the average error over points in xn, which is less dif\ufb01cult.\nTheorem 1. Suppose the space (X , \u00b5, \u03c1) has metric-measure dimension d.\nFor any x \u2208 X , de\ufb01ne fn(x) = fn(xt) where xt is the closest point to x in xn. Then at any time\nt = n, we have for some C independent of n,\n\n(cid:19)2/d\n\n(cid:18) d log n\nY + \u03bb2(cid:1) n\u22122/(2+d) .\n\nn\n\nE\n\nxn,Y n\n\n(cid:107)fn \u2212 f(cid:107)2\n\n2,\u00b5 \u2264 C(d log n) sup\n\nxn\n\nE n E\n\nY n\n\n(cid:107)fn(X) \u2212 f (X)(cid:107)2 + C\u03bb2\n\n+\n\n\u22062\nY\nn\n\n.\n\nIf the algorithm parameter \u02c6C \u2265 \u02c6C\u03c1, then for some constant C(cid:48) independent of n, we have at any\ntime n that\n\n|fn(X) \u2212 f (X)|2 \u2264 C(cid:48)(cid:0)\u22062\n\nE n E\n\nY n\n\nsup\nxn\n\nThe convergence rate is therefore \u02dcO(n\u22122/(2+d)), nearly optimal in terms of the unknown d (up to a\nlog n factor). In the simulation of Figure 2(Left) we compare our procedure to tree-based regressors\nwith a \ufb01xed setting of d and of \u0001t = t\u22121/(2+d). We use the classic rotating-Teapot dataset, where the\ntarget output values are the cosine of the rotation angles. Our method attains the same performance\nas the one with the right \ufb01xed setting of d.\nAs alluded to above,\nE n EY n |fn(X) \u2212 f (X)|2 on the sample xn.\n\nthe proof of Theorem 1 proceeds by \ufb01rst bounding the average error\nInterestingly, the analysis of the average error is\n\n4\n\n\fFigure 2: Simulation results on Teapot (Left) and Synthetic (Right) datasets. \u02c6C set to 1, size of the\ntest sets 1800 and 12500, respectively.\n\nof a worst-case nature where the data x1, x2, . . . is allowed to arrive adversarially (see analysis of\nSection 4.1). This shows a sense in which the algorithm is robust to bad dimension estimates: the\naverage error is of the optimal form in terms of d, even though the data could trick us into picking a\nbad guess di of d. Thus the insights behind the algorithm are perhaps of wider applicability to prob-\nlems of a more adversarial nature. This is shown empirically in Figure 2(Right), where we created a\nsynthetic datasets with d = 5, while the \ufb01rst 1000 samples are from a line in X . An algorithm that\nestimates the dimension in a \ufb01rst phase would end up using the suboptimal setting d = 1, while our\nalgorithm robustly updates its estimate over time.\nAs mentioned in the introduction, the same insights can be applied to other forms of regression in\na streaming setting. We show in the longer version of the paper a procedure more akin to kernel\nregression, which employs other quantities (appropriate to the method) to control the bias-variance\ntradeoff while deciding on keeping or rejecting the guess di.\n\n3.3 Lower-bounds\n\nWe have to produce a distribution for which the problem is hard, and which matches our streaming\nsetting as well as possible. With this in mind, our lower-bound result differs from existing non-\nparametric lower-bounds by combining two important aspects. First, the lower-bound holds for any\ngiven metric measure space (X , \u03c1, \u00b5) with \ufb01nite measure-dimension: we constrain the worst-case\ndistribution to have the marginal \u00b5 that nature happens to choose. In contrast, lower-bounds in lit-\nerature would commonly pick a suitable marginal on the space X [13, 14]. Second, the worst-case\ndistribution does not depend on the sample size as is common in literature. Instead, we show that\nthe rate of n\u22122/(2+d) holds for in\ufb01nitely many n for a distribution \ufb01xed beforehand. This is more\nappropriate for the online setting where the data is generated over time from a \ufb01xed distribution.\nThe lower-bound result of [9] also holds for a given measure space (X , \u00b5, \u03c1), but the worst-case dis-\ntribution depends on sample size. A lower-bound of [7] holds for in\ufb01nitely many n, but is restricted\nto distributions on a Euclidean cube, and thus bene\ufb01ts from the regularity of the cube. Our result\ncombines some technical intuition from these two results in a way described in Section 4.3.\nWe need the following de\ufb01nition.\nDe\ufb01nition 3. Given a metric-measure space (X , \u00b5, \u03c1), we let D\u00b5,\u03bb denote the set of distributions on\nX, Y , X \u2208 X , Y \u2208 R, where the marginal on X is \u00b5, and where the function f (x) = E[Y |X = x]\nis \u03bb-Lipschitz w.r.t. \u03c1.\nTheorem 2. Let (X , \u00b5, \u03c1) be a metric space of diameter 1, and metric-measure dimension d. For\nany n \u2208 N, de\ufb01ne r2\nThere exists an indexing subsequence {nt}t\u2208N , nt+1 > nt, such that\n\nn = (\u03bb2n)\u22122/(2+d). Pick any positive sequence {\u03b2n}n\u2208N , \u03b2n = o(cid:0)\u03bb2r2\n\n(cid:1).\n\nn\n\nEX nt ,Y nt (cid:107)fnt \u2212 f(cid:107)2\n\n2,\u00b5\n\n= \u221e,\n\ninf{fn} sup\nD\u00b5,\u03bb\n\nlim\nt\u2192\u221e\n\n\u03b2nt\n\nwhere the in\ufb01mum is taken over all sequences {fn} of estimators fn : X n, Y n (cid:55)\u2192 L2,\u00b5.\n\n5\n\n01000200030004000500060000.20.30.40.50.60.70.80.91# Training samplesNormalized RMSE on test setTeapot dataset Incremental Tree\u2212basedfixed d=1fixed d=4fixed d=800.511.522.533.54x 1040.10.20.30.40.50.60.70.80.911.1# Training samplesNormalized RMSE on test setSynthetic dataset, d=5, D=100, first 1000 samples d=1 Incremental Tree\u2212basedfixed d=1fixed d=5fixed d=10\fBy the statement of the theorem, if we pick any rate \u03b2n faster than n\u22122/(2+d), then there exists a\ndistribution with marginal \u00b5 for which E (cid:107)fn \u2212 f(cid:107)2 /\u03b2n either diverges or tends to \u221e.\n\n4 Analysis\n\nWe \ufb01rst analyze the average error of the algorithm over the data xn in Section 4.1. The proof of the\nmain theorem follows in Section 4.2.\n\n4.1 Bounds on Average Error\n\nWe start by bounding the average error on the sample xn at time n, that is we upper-bound\nE n EY n |fn(X) \u2212 f (X)|2.\nThe proof idea of the upper bound is the following. We bound the error in a given phase (Lemma 4),\nthen combine these errors over all phases to obtain the \ufb01nal bounds (Corollary 1). To bound the\nerror in a phase, we decompose the error in terms of squared bias and variance. The main technical\ndif\ufb01culty is that the bandwidth \u0001t varies over time and thus points at varying distances are included\nin each estimate. Nevertheless, if ni is the number of steps in Phase i, we will see that both average\nsquared bias and variance can be bounded by roughly n\nFinally, the algorithm ensures that the guess di is always an under-estimate of the unknown dimen-\nsion d, as proven in Lemma 3 (proof in the supplemental appendix), so integrating over all phases\nyields an adaptive bound on the average error. We assume throughout this section that the space\n(X , \u03c1) has dimension d for some \u02c6C\u03c1 (see Def. 2).\nLemma 3. Suppose the algorithm parameter \u02c6C \u2265 \u02c6C\u03c1. The following invariants hold throughout\nthe procedure for all phases i \u2265 1 of Algorithm 1:\n\n\u22122/(2+di)\ni\n\n.\n\n\u2022 i \u2264 di \u2264 d.\n\u2022 For any t \u2208 Phase i we have |Xi| \u2264 \u02c6C 4di\u0001\n\n\u2212di\nt\n\n.\n\nLemma 4 (Bound on single phase). Suppose the algorithm parameter \u02c6C \u2265 \u02c6C\u03c1. Consider Phase\ni \u2265 1 and suppose this phase lasts ni steps. Let Eni denote expectation relative to the uniform\nchoice of X out of {xt : t \u2208 Phase i}. We have the following bound:\n\n|fn(X) \u2212 f (X)|2 \u2264(cid:16) \u02c6C4d\u22062\n\nY + 12\u03bb2(cid:17)\n\n\u2212 2\ni\n\n2+d\n\n.\n\nn\n\nE\nni\n\nE\nY n\n\nProof. Let Xi(X) denote the center closest to X in Xi. Suppose Xi(X) = xs, s \u2208 [n], we let nxs\ndenote the number of points assigned to the center xs. We use the notation xt \u2192 xs to say that xt is\nassigned to center xs.\n(cid:80)\nFirst \ufb01x X \u2208 {xt : t \u2208 Phase i} and let xs = Xi(xt). De\ufb01ne \u02dcfn(X) \u2261 EY n fn(X) =\n\nf (xt). We proceed with the following standard bias-variance decomposition\n\n1\nnxs\n\nxt\u2192xs\n\nE\nY n\n\n|fn(X) \u2212 f (X)|2 = E\n\n(1)\nLet X = xr, r \u2265 s. We \ufb01rst bound the bias term. Using the Lipschitz property of f, and Jensen\u2019s\ninequality, we have\n\nY n\n\n+\n\n.\n\n(cid:12)(cid:12)(cid:12)fn(X) \u2212 \u02dcfn(X)\n(cid:12)(cid:12)(cid:12)2\n(cid:33)2\n\n(cid:12)(cid:12)(cid:12) \u02dcfn(X) \u2212 f (X)\n(cid:12)(cid:12)(cid:12)2\n(cid:88)\n\n\u03bb2\u03c1 (xr, xt)2\n\n(cid:12)(cid:12)(cid:12) \u02dcfn(X) \u2212 f (X)\n(cid:12)(cid:12)(cid:12)2 \u2264\n\n(cid:32)\n\nxt\u2192xs\n\n\u03bb\u03c1 (xr, xt)\n\n\u2264 1\nnxs\n\n1\nnxs\n\u2264 2\u03bb2\nnxs\n\n(cid:88)\n\u03c1 (xr, xs)2 + \u03c1 (xs, xt)2(cid:17) \u2264 2\u03bb2\n(cid:16)\n(cid:88)\n(cid:12)(cid:12)(cid:12)2\n(cid:12)(cid:12)(cid:12)fn(X) \u2212 \u02dcfn(X)\n(cid:88)\n\nEY n |Yt \u2212 f (xt)|2\n\nxt\u2192xs\n\nxt\u2192xs\n\nnxs\n\n=\n\nxt\u2192xs\n\nn2\nxs\n\nE\nY n\n\n\u2264 \u22062\nY\nnxs\n\n.\n\n(cid:88)\n\n(cid:0)\u00012\n\nxt\u2192xs\n\n(cid:1) .\n\nr + \u00012\nt\n\nThe variance term is easily bounded as follows:\n\n6\n\n\fNow take the expectation over X \u223c U {xt : t \u2208 Phase i}. We have:\n\nE\nni\n\nE\nY n\n\n|fn(X) \u2212 f (X)|2 =\n\nE n E\n\n|fn(X) \u2212 f (X)|2 \u00b7 1{X \u2192 xs}\n\nxs\u2208Xi\n\nxs\u2208Xi\n\n(cid:88)\n(cid:88)\n(cid:88)\n1\nni\nxs\u2208Xi\nY \u00b7 |Xi|\n\u22062\nni\n(cid:88)\n\n\u2264 1\nni\n\n=\n\n=\n\n(cid:32)\n\nY n\n\n(cid:88)\n\nxr\u2192xs\n\n\u22062\n\nY +\n\n+\n\n4\u03bb2\nni\n\n2\u03bb2\nni\n\n1\nnxs\n\n\u22062\nY\nnxs\n\n2\u03bb2\nnxs\n\n+\n\nxs\u2208Xi\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:90) ni\n\nxs\u2208Xi\n\nxt\u2192xs\n\n(cid:1)(cid:33)\n(cid:0)\u00012\n\nxt\u2192xs\n\n(cid:88)\n(cid:0)\u00012\n(cid:88)\n\nr + \u00012\nt\n\n(cid:88)\nxt\u2192xs\nxr\u2192xs\nY \u00b7 |Xi|\n\u22062\nni\n\n\u00012\nt =\n\n(cid:1)\n\nr + \u00012\nt\n\n+\n\n4\u03bb2\nni\n\n(cid:88)\n\nt\u2208Phase i\n\n\u00012\nt .\n\nTo bound the last term, we have(cid:88)\n\n\u00012\nt =\n\nt\u2208Phase i\n\nti\u2208[ni]\n\n\u2212 2\n2+di \u2264\n\nt\n\n0\n\n\u2212 2\n2+di d\u03c4 \u2264 3n\n\n\u03c4\n\n1\u2212 2\n2+di\ni\n\n.\n\nCombine with the previous derivation and with both statements of Lemma 3 to get\n\nE\nni\n\nE\nY n\n\n|fn(X) \u2212 f (X)|2 \u2264 \u22062\n\nY \u00b7 |Xi|\nni\n\n+ 12\u03bb2n\n\n\u2212 2\n2+di\ni\n\nY + 12\u03bb2(cid:17)\n\n\u2212 2\ni\n\n2+d\n\nn\n\n.\n\n\u2264(cid:16) \u02c6C 4d \u22062\nY + 12\u03bb2(cid:17)\n\nn\u2212 2\n\n2+d .\n\n(cid:16) \u02c6C 4d\u22062\n\nCorollary 1 (Combined phases). Suppose the algorithm parameter \u02c6C \u2265 \u02c6C\u03c1, then we have\n\nE n E\n\nY n\n\n|fn(X) \u2212 f (X)|2 \u2264 2\n\nProof. Let I denote the number of phases up to time n. We will decompose the expectation E n in\nterms of the various phases i \u2208 [I] and apply Lemma 4. Let B (cid:44) \u02c6C 4d \u22062\n\nY + 12\u03bb2. We have:\n\nI(cid:88)\n\n(cid:32) I(cid:88)\n\n(cid:33) d\n\n2+d\n\nE n E\n\nY n\n\n|fn(X) \u2212 f (X)|2 \u2264 B\n\n= B\n\n\u2212 2\ni\n\n2+d\n\nni\nn\n\nn\n\n(cid:17) d\n\n2+d\n\n= B\n\n= B \u00b7 I\n\nd\n\nn\n\n2+d\n\ni \u2264 B\n\n1\nI\nI\nn\n2+d \u2264 B \u00b7 d\n2+d n\u2212 2\n\ni=1\n\n2\n\nni\nI\nn\nI\n2+d n\u2212 2\n\ni=1\n\n2+d ,\n\n2\n\nI(cid:88)\n(cid:16) n\n\ni=1\nI\nn\n\nI\n\nwhere in the second inequality we use Jensen\u2019s inequality, and in the last inequality Lemma 3.\n\n4.2 Bound on L2 Error\n\nWe need the following lemma, whose proof is in the supplemental appendix, which bounds the\nprobability that a \u03c1-ball of a given radius contains a sample from xn. This will then allow us to\nbound the bias induced by transforming a solution for the adversarial setting to a solution for the\nstochastic setting.\nLemma 5. Suppose (X , \u03c1, \u00b5) has metric measure dimension d. Let \u00b5 be a distribution on X and\nlet \u00b5n denote an empirical distribution on an i.i.d sample xn from \u00b5. For \u0001 > 1/n, let B\u0001 denote the\nclass of \u03c1-balls centered on X of radius \u0001. There exists C depending on d such that the following\nholds. Let 0 < \u03b4 < 1, De\ufb01ne \u03b1n,\u03b4 = C (d log n + log(1/\u03b4)). Then, with probability at least 1 \u2212 \u03b4,\nfor all B \u2208 B\u0001 satisfying \u00b5(B) \u2265 \u03b1n,\u03b4/n we have \u00b5n(B) > 1/n.\nWe are now ready to prove Theorem 1.\nProof of Theorem 1. Fix \u03b4 = 1/n and de\ufb01ne \u03b1n,\u03b4 as in Lemma 5. Pick \u0001 = (\u03b1n,\u03b4/C1n)1/d \u2265 1/n,\nwhere C1 is such that every B \u2208 B\u0001 has mass at least C1\u0001d. Since for every B \u2208 B\u0001, \u00b5(B) \u2265\nC1\u0001\u2212d \u2265 \u03b1n,\u03b4/n, we have by Lemma 5, that with probability at least 1 \u2212 \u03b4, all B \u2208 B\u0001 contain a\npoint from xn. In other words, the event E that xn forms an \u0001-cover of X is (1 \u2212 \u03b4)-likely.\n\n7\n\n\fSuppose xt is the closest point in xn to x \u2208 X . We write x \u2192 xt. Then, under E, we have,\n(cid:107)f (x) \u2212 f (xt)(cid:107) \u2264 \u03bb\u0001. We therefore have by Fubini\u2019s theorem\n\n|fn(X) \u2212 f (X)|2 \u00b7 (1{E} + 1(cid:8) \u00afE(cid:9))\n\nE\n\nY n|xn\n2\u00b5(x : x \u2192 xt) E\nY n|xn\n\n|fn(xt) \u2212 f (xt)|2 + 2\u03bb2\u00012 + \u03b4\u22062\n\nY\n\nE\n\nxn,Y n\n\n(cid:107)fn \u2212 f(cid:107)2\n\n2,\u00b5 = E\n\nxn\n\n\u2264 E\n\nxn\n\n\u2264 E\n\nxn\n\nE\nX\n\nn(cid:88)\nn(cid:88)\n\nt=1\n\n2C2\u0001d E\nY n|xn\n\n|fn(xt) \u2212 f (xt)|2 + 2\u03bb2\u00012 + \u03b4\u22062\n\nY\n\n\u2264 2C2\u03b1n,\u03b4\n\nt=1\n\nC1\n\nE n E\n\nY n\n\n|fn(xt) \u2212 f (xt)|2 + 2\u03bb2\u00012 + \u03b4\u22062\nY ,\n\nsup\nxn\n\nwhere in the \ufb01rst inequality we break the integration over the Voronoi partition of X de\ufb01ned by the\npoints in xn, and introduce f (xt); the second inequality uses {x : x \u2192 xt} \u2282 B(xt, \u0001) under E.\n\n4.3 Lower-bound\n\nto show that\n\nLet\u2019s consider \ufb01rst the case of a \ufb01xed n. The idea behind the proof is as follows: for \u00b5 \ufb01xed, we\nhave to come up with a class F of functions which vary considerably on the space X . To this end we\ndiscretize X into as many cells as possible, and let any f \u2208 F potentially change sign from one cell\nto the other. The larger the dimension d the more we can discretize the space and the more complex\nF, subject to a Lipschitz constraint. The problem of picking the right f can thus be reduced to that\nof classi\ufb01cation, since the learner has to discover the sign of f on suf\ufb01ciently many cells.\nIn order to handle many data sizes n simultaneously, we borrow from the idea above.\nthe lower-bound holds for a subsequence {ni} simultaneously.\nSay we want\nThen we reserve a subset of the space X for each n1, n2, . . ., and discretize each sub-\nThe functions in F have to then vary suf\ufb01ciently in each sub-\nset according to ni.\nset of the space X according to the corresponding ni.\nThis is illustrated in Figure 3.\nWe can then apply the same idea of reduction to classi-\n\ufb01cation for each nt separately. This sort of idea appears\nin [7] where \u00b5 is uniform on the Euclidean cube, where\nthey use the regularity of the cube to set up the right se-\nquence of discretizations over subsets of the cube. The\nmain technicality in our result is that we work with a gen-\neral space without much regularity. The lack of regularity\nmakes it unclear a priori how to divide such a space into\nsubsets of the proper size for each ni.\nLast, we have to ensure that the functions f \u2208 F resulting\nfrom our discretization of a general metric space X are in fact Lipschitz. For this, we extend some\nof the ideas from [9] which handles the case of a \ufb01xed n. For lack of space, the complete proof is in\nthe extended version of the paper.\n\nFigure 3: Lower bound proof idea.\n\n5 Conclusions\n\nWe presented an ef\ufb01cient and nearly minimax optimal approach to nonparametric regression in a\nstreaming setting. The streaming setting is gaining more attention as modern data sizes are getting\nlarger, and as data is being acquired online in many applications.\nThe main insights behind the approach presented extend to other nonparametric methods, and are\nlikely to extend to settings of a more adversrial nature. We left open the question of optimal adap-\ntation to the smoothness of the unknown function, while we effciently solve the equally or more\nimportant question of adapting to the the unknown dimension of the data, which generally has a\nstronger effect on the convergence rate.\n\n8\n\n\fReferences\n[1] Y. Ben-Haim and E. Tom-Tov. A streaming parallel decision tree algorithm. Journal of Ma-\n\nchine Learning Research, 11:849\u2013872, 2010.\n\n[2] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbors. ICML, 2006.\n[3] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, New York, NY, USA, 2006.\n\n[4] K. Clarkson. Nearest-neighbor searching and metric space dimensions. Nearest-Neighbor\n\nMethods for Learning and Vision: Theory and Practice, 2005.\n\n[5] P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the 6th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 71\u201380,\n2000.\n\n[6] H. Gu and J. Lafferty. Sequential nonparametric regression. ICML, 2012.\n[7] L. Gyor\ufb01, M. Kohler, A. Krzyzak, and H. Walk. A Distribution Free Theory of Nonparametric\n\nRegression. Springer, New York, NY, 2002.\n\n[8] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for universal portfolios. Journal of Machine\n\nLearning Research, 3:423\u2013440, 2002.\n\n[9] S. Kpotufe. k-NN Regression Adapts to Local Intrinsic Dimension. NIPS, 2011.\n[10] R. Krauthgamer and J. R. Lee. Navigating nets: simple algorithms for proximity search. In\nProceedings of the \ufb01fteenth annual ACM-SIAM symposium on Discrete algorithms, SODA \u201904,\npages 798\u2013807, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics.\n[11] B. Pfahringer, G. Holmes, and R. Kirkby. Handling numeric attributes in hoeffding trees. In\nAdvances in Knowledge Discovery and Data Mining: Proceedings of the 12th Paci\ufb01c-Asia\nConference (PAKDD), volume 5012, pages 296\u2013307. Springer, 2008.\n\n[12] S. Schaal and C. Atkeson. Robot Juggling: An Implementation of Memory-based Learning.\n\nControl Systems Magazine, IEEE, 1994.\n\n[13] C. J. Stone. Optimal rates of convergence for non-parametric estimators. Ann. Statist., 8:1348\u2013\n\n1360, 1980.\n\n[14] C. J. Stone. Optimal global rates of convergence for non-parametric estimators. Ann. Statist.,\n\n10:1340\u20131353, 1982.\n\n[15] M. A. Taddy, R. B. Gramacy, and N. G. Polson. Dynamic trees for learning and design. Journal\n\nof the American Statistical Association, 106(493), 2011.\n\n[16] S. Vijayakumar and S. Schaal. Locally weighted projection regression: An O(n) algorithm for\nincremental real time learning in high dimensional space. In in Proceedings of the Seventeenth\nInternational Conference on Machine Learning (ICML), pages 1079\u20131086, 2000.\n\n9\n\n\f", "award": [], "sourceid": 912, "authors": [{"given_name": "Samory", "family_name": "Kpotufe", "institution": "TTI Chicago"}, {"given_name": "Francesco", "family_name": "Orabona", "institution": "TTI Chicago"}]}