{"title": "Online allocation and homogeneous partitioning for piecewise constant mean-approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1961, "page_last": 1969, "abstract": "In the setting of active learning for the multi-armed bandit, where the goal of a learner is to estimate with equal precision the mean of a finite number of arms, recent results show that it is possible to derive strategies based on finite-time confidence bounds that are competitive with the best possible strategy. We here consider an extension of this problem to the case when the arms are the cells of a finite partition P of a continuous sampling space X \\subset \\Real^d. Our goal is now to build a piecewise constant approximation of a noisy function (where each piece is one region of P and P is fixed beforehand) in order to maintain the local quadratic error of approximation on each cell equally low. Although this extension is not trivial, we show that a simple algorithm based on upper confidence bounds can be proved to be adaptive to the function itself in a near-optimal way, when |P| is chosen to be of minimax-optimal order on the class of \\alpha-H\u00f6lder functions.", "full_text": "Online allocation and homogeneous partitioning for\n\npiecewise constant mean-approximation\n\nOdalric Ambrym Maillard\nMontanuniversit\u00a8at Leoben\n\nFranz-Josef Strasse 18\nA-8700 Leoben, Austria\n\nAlexandra Carpentier\n\nStatistical Laboratory, CMS\nWilberforce Road, Cambridge\n\nCB3 0WB UK\n\nodalricambrym.maillard@gmail.com\n\na.carpentier@statslab.cam.ac.uk\n\nAbstract\n\nIn the setting of active learning for the multi-armed bandit, where the goal of a\nlearner is to estimate with equal precision the mean of a \ufb01nite number of arms,\nrecent results show that it is possible to derive strategies based on \ufb01nite-time con-\n\ufb01dence bounds that are competitive with the best possible strategy. We here con-\nsider an extension of this problem to the case when the arms are the cells of a\n\ufb01nite partition P of a continuous sampling space X \u2282 Rd. Our goal is now to\nbuild a piecewise constant approximation of a noisy function (where each piece is\none region of P and P is \ufb01xed beforehand) in order to maintain the local quadratic\nerror of approximation on each cell equally low. Although this extension is not\ntrivial, we show that a simple algorithm based on upper con\ufb01dence bounds can\nbe proved to be adaptive to the function itself in a near-optimal way, when |P| is\nchosen to be of minimax-optimal order on the class of \u03b1\u2212H\u00a8older functions.\n\n1 Setting and Previous work\nLet us consider some space X \u2282 Rd, and Y \u2282 R. We call X the input space or sampling space, Y\nthe output space or value space. We consider the problem of estimating with uniform precision the\nfunction f : X \u2282 Rd \u2192 Y \u2282 R. We assume that we can query n times the function f, anywhere in\nthe domain, and observe noisy samples of this function. These samples are collected sequentially,\nand our aim is to design an adaptive procedure that selects wisely where on the domain to query the\nfunction, according to the information provided by the previous samples. More formally:\nObserved process We consider an unknown Y-valued process de\ufb01ned on X , written \u03bd : X \u2192\nM+\n1 (Y) refers to the set of all probability measures on Y, such that for all x \u2208 X ,\nthe random variable Y (x) \u223c \u03bd(x) has mean f (x) def= E[Y (x)|x] \u2208 R. We write for convenience the\nmodel in the following way\n\n1 (Y), where M+\n\nY (x) = f (x) + noise(x) ,\n\nwhere noise(x) def= Y (x) \u2212 E[Y (x)|x] is the centered random variable corresponding to the noise,\nwith unknown variance \u03c32(x). We assume throughout this paper that f is \u03b1-H\u00a8older.\nPartition We consider we can de\ufb01ne a partition P of the input space X , with \ufb01nitely many P\nregions {Rp}1\u2264p\u2264P that are assumed to be convex and not degenerated, i.e. such that the interior\nof each region Rp has positive Lebesgue volume vp. Moreover, with each region Rp is associated\na sampling distribution in that region, written \u00b5p \u2208 M+\n1 (Rp). Thus, when we decide to sample in\nregion Rp, a new sample X \u2208 Rp is generated according to X \u223c \u00b5p.\nAllocation. We consider that we have a \ufb01nite budget of n \u2208 N samples that we can use in order\nto allocate samples as we wish among the regions {Rp}1\u2264p\u2264P . For illustration, let us assume that\nwe deterministically allocate Tp,n \u2208 N samples in region Rp, with the constraint that the alloca-\ntion {Tp,n}1\u2264p\u2264P must some to n. In region Rp, we thus sample points {Xp,i}1\u2264p\u2264P at random\n\n1\n\n\faccording to the sampling distribution \u00b5p, and then get the corresponding values {Yp,i}1\u2264i\u2264Tp,n,\nwhere Yp,i \u223c \u03bd(Xp,i). In the sequel, the distribution \u00b5p is assumed to be the uniform distribution\nover region Rp, i.e. the density of \u00b5p is d\u03bb(x)1x\u2208Rp\nwhere \u03bb denotes the Lebesgue measure. Note\n\u03bb(Rp)\nthat this is not restrictive since we are in an active, not passive setting.\nPiecewise constant mean-approximation. We use the collected samples in order to build a piece-\nwise constant approximation \u02c6fn of the mean f, and measure the accuracy of approximation on a\nregion Rp with the expected quadratic norm of the approximation error, namely\n\nE\ufffd \ufffdRp\n\n(f (x) \u2212 \u02c6fn(x))2 \u03bb(dx)\n\n\u03bb(Rp)\ufffd = E\u00b5p,\u03bd\ufffd(f (X) \u2212 \u02c6mp,n)2\ufffd ,\n\nwhere \u02c6mp,n is the constant value that takes \u02c6fn on the region Rp. A natural choice for the estimator\n\u02c6mp,n is to use the empirical mean that is unbiased and asymptotically optimal for this criterion.\nThus we consider the following estimate (histogram)\n\n\u02c6fn(x) =\n\n\u02c6mp,nI{x \u2208 Rp} where \u02c6mp,n =\n\n1\n\nTp,n\n\nP\ufffdp=1\n\nTp,n\ufffdi=1\n\nYp,i .\n\nPseudo loss Note that, since the Tp,n are deterministic, the expected quadratic norm of the approxi-\nmation error of this estimator can be written in the following form\n\nE\u00b5p,\u03bd\ufffd(f (X) \u2212 \u02c6mp,n)2\ufffd = E\u00b5p,\u03bd\ufffd(f (X) \u2212 E\u00b5p [f (X)])2\ufffd + E\u00b5p,\u03bd\ufffd(E\u00b5p [f (X)] \u2212 \u02c6mp,n)2\ufffd\n\n= V\u00b5p\ufffdf (X)\ufffd + V\u00b5p,\u03bd\ufffd \u02c6mp,n\ufffd\n= V\u00b5p\ufffdf (X)\ufffd +\n\nV\u00b5p,\u03bd\ufffdY (X)\ufffd .\n\n1\n\nTp,n\nNow, using the following immediate decomposition\n\nV\u00b5p,\u03bd\ufffdY (X)\ufffd = V\u00b5p\ufffdf (X)\ufffd + \ufffdRp\n\n\u03c32(x)\u00b5p(dx) ,\n\nwe deduce that the maximal expected quadratic norm of the approximation error over the regions\n{Rp}1\u2264p\u2264P , that depends on the choice of the considered allocation strategy A def= {Tp,n}1\u2264p\u2264P\nis thus given by the following so-called pseudo-loss\n\nLn(A) def= max\n\n1\u2264 p \u2264P\ufffd Tp,n + 1\n\nTp,n\n\nV\u00b5p\ufffdf (X)\ufffd +\n\n1\n\nTp,n\n\nE\u00b5p\ufffd\u03c32(X)\ufffd\ufffd.\n\n(1)\n\nOur goal is to minimize this pseudo-loss. Note that this is a local measure of performance, as\nopposed to a more usual yet less challenging global quadratic error. Eventually, as the number of\n\ncells tends to \u221e, this local measure of performance approaches supx\u2208X E\u03bd\ufffd\ufffdf (x) \u2212 \u02c6fn(x)\ufffd2\ufffd. At\n\nthis point, let us also introduce, for convenience, the notation Qp(Tp,n) that denotes the term inside\nthe max, in order to emphasize the dependency on the quadratic error with the allocation.\nPrevious work\nThere is a huge literature on the topic of functional estimation in batch setting. Since it is a rather\nold and well studied question in statistics, many books have been written on this topic, such as Bosq\nand Lecoutre [1987], Rosenblatt [1991], Gy\u00a8or\ufb01 et al. [2002], where piecewise constant mean-\napproximation are also called \u201cpartitioning estimate\u201d or \u201cregressogram\u201d (\ufb01rst introduced by Tukey\n[1947]). The minimax-optimal rate of approximation on the class of \u03b1-H\u00a8older functions is known\nto be in O(n\u2212 2\u03b1\n2\u03b1+d ) (see e.g. Ibragimov and Hasminski [1981], Stone [1980], Gy\u00a8or\ufb01 et al. [2002]).\nIn such setting, a dataset {(Xi, Yi)}i\u2264n is given to the learner, and a typical question is thus to try\nto \ufb01nd the best possible histogram in order to minimize a approximation error. Thus the dataset is\n\ufb01xed and we typically resort to techniques such as model selection where each model corresponds\nto one histogram (see Arlot [2007] for an extensive study of such).\nHowever, we here ask a very different question, that is how to optimally sample in an online setting\nin order to minimize the approximation error of some histogram. Thus we choose the histogram\n\n2\n\n\fbefore we see any sample, then it is \ufb01xed and we need to decide which cell to sample from at\neach time step. Motivation for this setting comes naturally from some recent works in the setting\nof active learning for the multi-armed bandit problem Antos et al. [2010], Carpentier et al. [2011].\nIn these works, the objective is to estimate with equal precision the mean of a \ufb01nite number of\ndistributions (arms), which would correspond to the special case when X = {1, . . . , P} is a \ufb01nite\nset in our setting. Intuitively, we reduce the problem to such bandit problem with \ufb01nite set of arms\n(regions), and our setting answers the question whether it is possible to extend those results to the\ncase when the arms do not correspond to a singleton, but rather to a continuous region. We show\nthat the answer is positive, yet non trivial. This is non trivial due to the variance estimation in\neach region: points x in some region may have different means f(x), so that standard estimators for\nthe variance are biased, contrary to the point-wise case and thus \ufb01nite-arm techniques may yield\ndisastrous results. (Estimating the variance of the distribution in a continuous region actually needs\nto take into account not only the point-wise noise but also the variation of the function f and the\nnoise level \u03c32 in that region.) We describe a way, inspired from quasi Monte-Carlo techniques, to\ncorrect this bias so that we can handle the additional error. Also, it is worth mentioning that this\nsetting can be informally linked to a notion of curiosity-driven learning (see Schmidhuber [2010],\nBaranes and Oudeyer [2009]), since we want to decide in which region of the space to sample,\nwithout explicit reward but optimizing the goal to understand the unknown environment.\nOutline Section 2 provides more intuition about the pseudo-loss and a result about the optimal or-\nacle strategy when the domain is partitioned in a minimax-optimal way on the class of \u03b1\u2212H\u00a8older\nfunctions. Section 3 presents our assumptions, that are basically to have a sub-Gaussian noise and\nsmooth mean and variance functions, then our estimator of the pseudo-loss together with its con-\ncentration properties, before introducing our sampling procedure, called OAHPA-pcma. Finally, the\nperformance of this procedure is provided and discussed in Section 4.\n2 The pseudo-loss: study and optimal strategies\n2.1 More intuition on each term in the pseudo-loss\nIt is natural to look at what happens to each of the two terms that appear in equation 1 when one\nmakes Rp shrink towards a point. More precisely, let xp be the mean of X \u223c \u00b5p and let us look at\nthe limit of V\u00b5p (f (X)) when vp goes to 0. Assuming that f is differentiable, we get\n\nlim\nvp\u21920\n\nV\u00b5p (f (X)) = lim\nvp\u21920\n= lim\nvp\u21920\n= lim\nvp\u21920\n= lim\n\nE\u00b5p\ufffd\ufffdf (X) \u2212 f (xp) \u2212 E[f (X) \u2212 f (xp)]\ufffd2\ufffd\nE\u00b5p\ufffd\ufffd\ufffdX \u2212 xp,\u2207f (xp)\ufffd \u2212 E[\ufffdX \u2212 xp,\u2207f (xp)\ufffd]\ufffd2\ufffd\nE\u00b5p\ufffd\ufffdX \u2212 xp,\u2207f (xp)\ufffd2\ufffd\n\nTherefore, if we introduce \u03a3p to be the covariance matrix of the random variable X \u223c \u00b5p, then we\nsimply have lim\nvp\u21920\n\nvp\u21920\u2207f (xp)T E\u00b5p\ufffd(X \u2212 xp)(X \u2212 xp)T\ufffd\u2207f (xp) .\nvp\u21920||\u2207f (xp)||2\n\u03a3p.\n\nV\u00b5p (f (X)) = lim\n\nExample with hyper-cubic regions An important example is when Rp is a hypercube with side\nlength v1/d\nand \u00b5p is the uniform distribution over the region Rp. In that case (see Lemma 1), we\nhave \u00b5p(dx) =\n\n, and\n\np\n\ndx\nvp\n\n||\u2207f (xp)||2\n\n\u03a3p = ||\u2207f (xp)||2 v2/d\n\np\n12\n\n.\n\nMore generally, when f is \u03b1\u2212differentiable, i.e. that \u2200a \u2208 X ,\u2203\u2207\u03b1f (a,\u00b7) \u2208 Sd(0, 1)R such that\n= \u2207\u03b1f (a, x), then it is not too dif\ufb01cult to show that for such\n\u2200x \u2208 Sd(0, 1), limh\u21920\nhyper-cubic regions, we have\n\nf (a+hx)\u2212f (a)\n\nh\u03b1\n\nV\u00b5p\ufffdf (X)\ufffd = O\ufffdv\n\n2\u03b1\nd\np\n\nsup\n\nS(0,1)|\u2207\u03b1f (xp, u)|2\ufffd.\n\nOn the other hand, by direct computation, the second term is such that limvp\u21920 E\u00b5p\ufffd\u03c32(X)\ufffd =\n\u03c32(xp). Thus, while V\u00b5p\ufffdf (X)\ufffd vanishes, E\u00b5p\ufffd\u03c32(X)\ufffd stays bounded away from 0 (unless \u03bd is\n\ndeterministic).\n\n3\n\n\f2.2 Oracle allocation and homogeneous partitioning for piecewise constant\n\nmean-approximation.\n\nWe now assume that we are allowed to choose the partition P depending on n, thus P = Pn,\namongst all homogeneous partitions of the space, i.e. partitions such that all cells have the same\nvolume, and come from a regular grid of the space. Thus the only free parameter is the number of\ncells Pn of the partition.\nAn exact yet not explicit oracle algorithm. The minimization of the pseudo-loss (1) does not yield\nto a closed-form solution in general. However, we can still derive the order of the optimal loss\n(see [Carpentier and Maillard, 2012, Lemma 2] in the full version of the paper for an example of\nminimax yet non adaptive oracle algorithm given in closed-form solution):\n\nLemma 1 In the case when V\u00b5p\ufffdf (X)\ufffd = \u03a9\ufffdP \u2212\u03b1\ufffd\n\noptimal allocation and partitioning strategy A\ufffd\n\nn\n\nn satis\ufb01es that\n\nP \ufffd\n\nn = \u03a9(n\n\n1\n\nmax(1+\u03b1\ufffd\u2212\u03b2\ufffd ,1) )\n\nand\n\nT \ufffd\np,n\n\nn\n\n\u03c32(x)\u00b5p(dx) = \u03a9\ufffdP \u2212\u03b2\ufffd\n\ufffd and\ufffdRp\nV\u00b5p\ufffdf (X)\ufffd + E\u00b5p\ufffd\u03c32(X)\ufffd\nL \u2212 V\u00b5p\ufffdf (X)\ufffd\n\ndef=\n\n,\n\n\ufffd, then an\n\nas soon as there exists, for such range of P \ufffd\n\nn, a constant L such that\n\nP \ufffd\n\nn\ufffdp=1\n\nV\u00b5p\ufffdf (X)\ufffd + E\u00b5p\ufffd\u03c32(X)\ufffd\n\nL \u2212 V\u00b5p\ufffdf (X)\ufffd\n\n= n .\n\nn, optimal amongst the allocations strategies that use the\n\nThe pseudo-loss of such an algorithm A\ufffd\npartition Pn in P \ufffd\nn regions, is then given by\nn) = \u03a9\ufffdn\u03b3\ufffd where\nLn(A\ufffd\n\n\u03b3 def=\n\nmax(1 \u2212 \u03b2\ufffd, 1 \u2212 \u03b1\ufffd)\nmax(1 + \u03b1\ufffd \u2212 \u03b2\ufffd, 1) \u2212 1 .\n\nThe condition involving the constant L is here to ensure that the partition is not degenerate. It is\nmorally satis\ufb01ed as soon as the variance of f and the noise are bounded and n is large enough.\nThis Lemma applies to the important class W 1,2(R) of functions that admit a weak derivative that\nIndeed these functions are H\u00a8older with coef\ufb01cient \u03b1 = 1/2, i.e. we have\nbelongs to L2(R).\nW 1,2(R) \u2282 C0,1/2(R). The standard Brownian motion is an example of function that is 1/2-H\u00a8older.\nMore generally, for k = d\n2 + \u03b1 with \u03b1 = 1/2 when d is odd and \u03b1 = 1 when d is even, we have the\ninclusion\n\nW k,2(Rd) \u2282 C0,\u03b1(Rd) ,\n\nwhere W k,2(Rd) is the set of functions that admit a kth weak derivative belonging to L2(Rd). Thus\nthe previous Lemma applies to suf\ufb01ciently smooth functions with smoothness linearly increasing\nwith the dimension d of the input space X .\nImportant remark Note that this Lemma gives us a choice of the partition that is minimax-optimal,\nand an allocation strategy on that partition that is not only minimax-optimal but also adaptive to the\nfunction f itself. Thus it provides a way to decide in a minimax way what is the good number of\nregions, and then to provide the best oracle way to allocate the budget.\nWe can deduce the following immediate corollary on the class of \u03b1\u2212H\u00a8older functions observed in a\nnon-negligible noise of bounded variance (i.e. in the setting \u03b2\ufffd = 0 and \u03b1\ufffd = 2\u03b1\nd ).\nCorollary 1 Consider that f is \u03b1\u2212H\u00a8older and the noise is of bounded variance. Then a minimax-\noptimal partition satis\ufb01es P \ufffd\nd+2\u03b1 ) and an optimal allocation achieves the rate Ln(A\ufffd\nn) =\n\u03a9\ufffdn \u22122\u03b1\nd+2\u03b1\ufffd. Moreover, the strategy of Lemma 1 is optimal amongst the allocations strategies that\nuse the partition Pn in P \ufffd\nThe rate \u03a9\ufffdn \u22122\u03b1\nd+2\u03b1\ufffd is minimax-optimal on the class of \u03b1\u2212H\u00a8older functions (see Gy\u00a8or\ufb01 et al. [2002],\n\ufffdV\u00b5p\ufffdf\ufffd\ufffdp\u2264P and\ufffdE\u00b5p\ufffd\u03c32\ufffd\ufffdp\u2264P are known to the learner, it is optimal, in the aim of minimizing\n\nIbragimov and Hasminski [1981], Stone [1980]), and it is thus interesting to consider an initial num-\nd+2\u03b1 ). After having built the partition, if the quantities\nber of regions P \ufffd\n\np,n provided in Lemma 1. Our\nthe pseudo-loss, to allocate to each region the number of samples T \ufffd\nobjective in this paper is, after having chosen beforehand a minimax-optimal partition, to allocate\n\nn that is of order P \ufffd\n\nn regions.\n\nn = \u03a9(n\n\nn = \u03a9(n\n\nd\n\nd\n\n4\n\n\fthe samples properly in the regions, without having any access to those quantities. It is then neces-\n\nsary to balance between exploration, i.e. allocating the samples in order to estimate\ufffdV\u00b5p\ufffdf\ufffd\ufffdp\u2264P\nand\ufffdE\u00b5p\ufffd\u03c32\ufffd\ufffdp\u2264P , and exploitation, i.e. use the estimates to target the optimal allocation.\n\n3 Online algorithms for allocation and homogeneous partitioning for\n\npiecewise constant mean-approximation\n\nIn this section, we now turn to the design of algorithms that are fully online, with the goal to be\ncompetitive against the kind of oracle algorithms considered in Section 2.2. We now assume that the\nspace X = [0, 1]d is divided in Pn hyper-cubic regions of same measure (the Lebesgue measure on\n[0, 1]d) vp = v = 1\n. The goal of an algorithm is to minimize the quadratic error of approximation\nPn\nof f by a constant over each cell, in expectation, which we write as\n\nE\ufffd \ufffdRp\n\n(f (x) \u2212 \u02c6fn(x))2 \u03bb(dx)\n\n\u03bb(Rp) \ufffd = max\n\n1\u2264p\u2264Pn\n\nE\ufffd \ufffdRp\n\n\u03bb(Rp) \ufffd ,\n(f (x) \u2212 \u02c6mp,n)2 \u03bb(dx)\n\nmax\n1\u2264p\u2264Pn\n\n1\n\nKPn\n\n. Equivalently, this can be seen as letting the player use a re\ufb01ned partition with P +\n\nwhere \u02c6fn is the histogram estimate of the function f on the partition P and \u02c6mp,n is the empirical\nmean de\ufb01ned on region Rp with the samples (Xi, Yi) such that Xi \u2208 Rp. To do so, an algorithm is\nonly allowed to specify at each time step t, the next point Xt where to sample, based on all the past\nsamples {(Xs, Ys)}s 0. Then with the choice of the number of regions Pn\nnumber of sub-regions K def= C\nAd1\u2212\u03b1 then the pseudo-loss of the OAHPA-\npcma algorithm satis\ufb01es, under the assumptions of Section 3.1 and on an event of probability higher\nthan 1 \u2212 \u03b4,\n\nbe the distortion factor of the optimal allocation strat-\n\n\u03b1 , where C def= 8L2\u03b1\n\n4\u03b1+d \ufffd\u22122\u2212 d\n\nmaxp T \ufffd\nminp T \ufffd\n\n2\u03b1+d \ufffd2+ d\n\ndef= n\n\np,n\n\np,n\n\n2d\n\nd\n\nLn(A) \u2264 \ufffd1 + \ufffd\u03b3C\ufffd\ufffdlog(1/\u03b4)\ufffdLn(A\ufffd\n\nn) + o\ufffdn\u2212 2\u03b1\n2\u03b1+d\ufffd ,\n\nfor some numerical constant C\ufffd not depending on n, where A\ufffd\n\nn is the oracle of Lemma 1.\n\n7\n\n\fn) = O(n\n\n2\u03b1\n\nd\n\nMinimax-optimal partitioning and \ufffd-adaptive performance Theorem 1 provides a high proba-\nbility bound on the performance of the OAHPA-pcma allocation strategy. It shows that this perfor-\nmance is competitive with that of an optimal (i.e. adaptive to the function f, see Lemma 1) allocation\n2\u03b1+d for the class of\nA\ufffd on a partition with a number of cells Pn chosen to be of minimax order n\n\u03b1-H\u00a8older functions. In particular, since Ln(A\ufffd\nd+2\u03b1 ) on that class, we recover the same\nminimax order as what is obtained in the batch learning setting, when using for instance wavelets,\nor Kernel estimates (see e.g. Stone [1980], Ibragimov and Hasminski [1981]). But moreover, due to\nn to the function itself, this procedure is also \ufffd-adaptive to the function and not\nthe adaptivity of A\ufffd\nonly minimax-optimal on the class, on that partition (see Section 2.2). Naturally, the performance of\nthe method increases, in the same way than for any classical functional estimation method, when the\nsmoothness of the function increases. Similarly, in agreement with the classical curse of dimension,\nthe higher the dimension of the domain, the less ef\ufb01cient the method.\nLimitations\nIn this work, we assume that the smoothness \u03b1 of the function is available to the\nlearner, which enables her to calibrate Pn properly. Now it makes sense to combine the OAHPA-\npcma procedure with existing methods that enable to estimate this smoothness online (under a\nslightly stronger assumption than H\u00a8older, such as H\u00a8older functions that attain their exponents,\nsee Gin\u00b4e and Nickl [2010]). It is thus interesting, when no preliminary knowledge on the smoothness\nof f is available, to spend some of the initial budget in order to estimate \u03b1.\nWe have seen that the OAHPA-pcma procedure, although very simple, manages to get minimax\noptimal results. Now the downside of the simplicity of the OAHPA-pcma strategy is two-fold.\n\nThe \ufb01rst limitation is that the factor (1 + \ufffd\u03b3C\ufffd\ufffdlog(1/\u03b4)) = (1 + O(\ufffd)) appearing in the bound\nbefore Ln(A\ufffd) is not 1, but higher than 1. Of course it is generally dif\ufb01cult to get a constant 1 in\nthe batch setting (see Arlot [2007]), and similarly this is a dif\ufb01cult task in our online setting too: If\n\ufffd is chosen to be small, then the error with respect to the optimal allocation is small. However, since\nPn is expressed as an increasing function of \ufffd, this implies that the minimax bound on the loss for\npartition P increases also with \ufffd. That said, in the view of the work on active learning multi-armed\nbandit that we extend, we would still prefer to get the optimal constant 1.\nThe second limitation is more problematic: since K is chosen irrespective of the region Rp, this\ncauses the presence of the factor \u03b3. Thus the algorithm will essentially no longer enjoy near-optimal\nperformance guarantees when the optimal allocation strategy is highly not homogeneous.\nConclusion and future work In this paper, we considered online regression with histograms in\nan active setting (we select in which bean to sample), and when we can choose the histogram in a\nclass of homogeneous histograms. Since the (unknown) noise is heteroscedastic and we compete\nnot only with the minimax allocation oracle on \u03b1-H\u00a8older functions but with the adaptive oracle\nthat uses a minimax optimal histogram and allocates samples adaptively to the target function, this\nis an extremely challenging (and very practical) setting. Our contribution can be seen as a non\ntrivial extension of the setting of active learning for multi-armed bandits to the case when each arm\ncorresponds to one continuous region of a sampling space, as opposed to a singleton, which can also\nbe seen as a problem of non parametric function approximation. This new setting offers interesting\nchallenges: We provided a simple procedure, based on the computation of upper con\ufb01dence bounds\nof the estimation of the local quadratic error of approximation, and provided a performance analysis\nthat shows that OAHPA-pcma is \ufb01rst order \ufffd-optimal with respect to the function, for a partition\nchosen to be minimax-optimal on the class of \u03b1-H\u00a8older functions. However, this simplicity also\nhas a drawback if one is interested in building exactly \ufb01rst order optimal procedure, and going\nbeyond these limitations is de\ufb01nitely not trivial: A more optimal but much more complex algorithm\nwould indeed need to tune a different factor Kp in each cell in an online way, i.e. de\ufb01ne some Kp,t\nthat evolves with time, and rede\ufb01ne sub-regions accordingly. Now, the analysis of the OAHPA-pcma\nalready makes use of powerful tools such as empirical-Bernstein bounds for variance estimation (and\nnot only for mean estimation), which make it non trivial; in order to handle possibly evolving sub-\nregions and deal with the progressive re\ufb01nement of the regions, we would need even more intricate\nanalysis, due to the fact that we are online and active. This interesting next step is postponed to\nfuture work.\nAcknowledgements This research was partially supported by Nord-Pas-de-Calais Regional Coun-\ncil, French ANR EXPLO-RA (ANR-08-COSI-004), the European Communitys Seventh Framework\nProgramme (FP7/2007-2013) under grant agreement no 270327 (CompLACS) and no 216886 (PAS-\nCAL2).\n\n8\n\n\fReferences\nAndr`as Antos, Varun Grover, and Csaba Szepesv`ari. Active learning in heteroscedastic noise. The-\n\noretical Computer Science, 411(29-30):2712\u20132728, 2010.\n\nSylvain Arlot. R\u00b4e\u00b4echantillonnage et S\u00b4election de mod`eles. PhD thesis, Universit\u00b4e Paris Sud - Paris\n\nXI, 2007.\n\nA. Baranes and P.-Y. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learn-\n\ning. IEEE Transactions on Autonomous Mental Development, 1(3):155\u2013169, October 2009.\n\nD. Bosq and J.P. Lecoutre. Th\u00b4eorie de l\u2019estimation fonctionnelle, volume 21. Economica, 1987.\nAlexandra Carpentier and Odalric-Ambrym Maillard.\n\nneous partitioning for piecewise constant mean-approximation.\nhttp://hal.archives-ouvertes.fr/hal-00742893.\n\nOnline allocation and homoge-\nURL\n\nHAL, 2012.\n\nAlexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, Rmi Munos, and Peter Auer.\nUpper-con\ufb01dence-bound algorithms for active learning in multi-armed bandits. In Jyrki Kivinen,\nCsaba Szepesv`ari, Esko Ukkonen, and Thomas Zeugmann, editors, Algorithmic Learning Theory,\nvolume 6925 of Lecture Notes in Computer Science, pages 189\u2013203. Springer Berlin / Heidelberg,\n2011.\n\nE. Gin\u00b4e and R. Nickl. Con\ufb01dence bands in density estimation. The Annals of Statistics, 38(2):\n\n1122\u20131170, 2010.\n\nL. Gy\u00a8or\ufb01, M. Kohler, A. Krzy\u00b4zak, and Walk H. A distribution-free theory of nonparametric regres-\n\nsion. Springer Verlag, 2002.\n\nI. Ibragimov and R. Hasminski. Statistical estimation: Asymptotic theory. 1981.\nM. Rosenblatt. Stochastic curve estimation, volume 3. Inst of Mathematical Statistic, 1991.\nJ. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (19902010). Autonomous\n\nMental Development, IEEE Transactions on, 2(3):230\u2013247, 2010.\n\nC.J. Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics,\n\npages 1348\u20131360, 1980.\n\nJ.W. Tukey. Non-parametric estimation ii. statistically equivalent blocks and tolerance regions\u2013the\n\ncontinuous case. The Annals of Mathematical Statistics, 18(4):529\u2013539, 1947.\n\n9\n\n\f", "award": [], "sourceid": 965, "authors": [{"given_name": "Alexandra", "family_name": "Carpentier", "institution": null}, {"given_name": "Odalric-ambrym", "family_name": "Maillard", "institution": null}]}