{"title": "Nonparametric Contextual Bandits in Metric Spaces with Unknown Metric", "book": "Advances in Neural Information Processing Systems", "page_first": 14684, "page_last": 14694, "abstract": "Consider a nonparametric contextual multi-arm bandit problem where each arm $a \\in [K]$ is associated to a nonparametric reward function $f_a: [0,1] \\to \\mathbb{R}$ mapping from contexts to the expected reward. Suppose that there is a large set of arms, yet there is a simple but unknown structure amongst the arm reward functions, e.g. finite types or smooth with respect to an unknown metric space. We present a novel algorithm which learns data-driven similarities amongst the arms, in order to implement adaptive partitioning of the context-arm space for more efficient learning. We provide regret bounds along with simulations that highlight the algorithm's dependence on the local geometry of the reward functions.", "full_text": "Nonparametric Contextual Bandits\n\nin an Unknown Metric Space\n\nNirandika Wanigasekara\n\nComputer Science\n\nNational University of Singapore\nnirandiw@comp.nus.edu.sg\n\nOperations Research and Information Engineering\n\nChristina Lee Yu\n\nCornell University\n\ncleeyu@cornell.edu\n\nAbstract\n\nConsider a nonparametric contextual multi-arm bandit problem where each arm\na \u2208 [K] is associated to a nonparametric reward function fa : [0, 1] \u2192 R mapping\nfrom contexts to the expected reward. Suppose that there is a large set of arms,\nyet there is a simple but unknown structure amongst the arm reward functions,\ne.g. \ufb01nite types or smooth with respect to an unknown metric space. We present a\nnovel algorithm which learns data-driven similarities amongst the arms, in order\nto implement adaptive partitioning of the context-arm space for more ef\ufb01cient\nlearning. We provide regret bounds along with simulations that highlight the\nalgorithm\u2019s dependence on the local geometry of the reward functions.\n\n1\n\nIntroduction\n\nContextual multi-arm bandits have been used to model the task of sequential decision making in\nwhich the rewards of different decisions must be learned over trial via trial-and-error. The decision\nmaker receives reward for each of the arms (i.e. actions or options) she chooses across the time\nhorizon T . In each trial t, the decision maker observes the context xt, which represents the set of\nobservable factors of the environment that could impact the performance of the action she chooses.\nThe decision maker must select an action based on the context and all past observations. Upon\nchoosing action a \u2208 [K], she observes a reward, which is assumed to be a stochastic observation of\nfa(x), the expected reward of action a at context x. In each trial, she faces the dilemma of whether\nto choose an action in order to learn about its performance (i.e. exploration), or to choose an action\nthat she believes will perform well as estimated from the limited previous data (i.e. exploitation).\nConsider a setting when the number of actions is very large, e.g. there is a large number of users\nand products on an e-commerce platform such that fully exploring the entire space of possible\nrecommendations is costly. It is often the case that there is additional structure amongst the large\nspace of actions, which the algorithm could exploit to learn more ef\ufb01ciently. In real-world applications\nhowever, this additional structure is often unknown a priori and must be learned from the data, which\nitself could be costly as well. It becomes important to understand the tradeoff and costs of learning\nrelationships amongst the arm from data over the course of the contextual bandit time horizon. We\nconsider a stochastic nonparametric contextual bandit setting in which the algorithm is not given\nany information a priori about the relationship between the actions. The key question is: Can an\nalgorithm exploit hidden structure in a nonparametric contextual bandit problem with no a priori\nknowledge of the underlying metric?\n\nContributions To our knowledge, we propose the \ufb01rst nonparametric contextual multi-arm bandit\nalgorithm that incorporates latent arm similarities in a setting where no a priori information about\nthe features or metric amongst the arms is given to the algorithm. The algorithm can learn more\nef\ufb01ciently by sharing data across similar arms, but the tradeoff between the cost of estimating arm\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsimilarities must be carefully accounted for. Our algorithm builds upon Slivkin\u2019s Zooming algorithm\n[22], adaptively partitioning the context-arm space using pairwise arm similarities estimated from the\ndata. The adaptive partitioning allows the algorithm to naturally adapt the precision of its estimates\naround regions of the context-arm space that are nearly optimal, enabling the algorithm to more\nef\ufb01ciently allocate its observations to regions of high reward.\nWe provide upper bounds on the regret that show the algorithm\u2019s dependence on the local geometry\nof the reward functions. If we let f\u2217(x) := maxa\u2208[K] fa(x) denote the optimal reward at context\nx, then the regret depends on how the mass of the set {(a, x) : f\u2217(x) \u2212 fa(x) \u2208 (0, \u03b4]} scales as \u03b4\ngoes to zero. This set represents the \u03b4-optimal region of the context-arm space except for the exactly\noptimal arms, i.e. the local measure of nearly optimal options centered around the optimal policy.\nThe scaling of this set captures the notion of \u201cgap\u201d used in classical multi-arm bandit problems, but\nin the general contextual bandit setting with a large number of arms, it may be reasonable that the\nsecond optimal arm is very close in value to the optimal arm such that the gap is always very small.\nInstead the relevant quantity is the relative measure of arms that are \u03b4-optimal yet not optimal, i.e.\nhave gap less than \u03b4. If the mass of such arms decreases linearly with respect to \u03b4, then we show that\nour algorithm achieves regret of O(\nAn interesting property of our algorithm is that it is fully data-dependent and thus does not depend on\nthe chosen representation of the arm space. The arm similarities (or distances) are measured from\ndata collected by the algorithm itself, and thus approximates a notion of distance that is de\ufb01ned\nwith respect to the reward functions {fa}a\u2208[K]. The algorithm would perform the same for any\npermutation of the arms. In contrast, consider existing algorithms which assume a given distance\nmetric or kernel function which the reward function is assumed to be smooth with respect to. Those\nalgorithms are sensitive to the metric or kernel given to it, which itself could be expensive to learn or\napproximate from data. Suppose that nature applied a measure preserving transformation to the arm\nmetric space such that the function is still Lipschitz but has a signi\ufb01cantly larger Lispchitz constant.\nFor example, consider a periodic function that repeats across the arm metric space. The performance\nof existing algorithms would degrade with poorer arm feature representations, whereas the algorithm\nwe propose would remain agnostic to such manipulations.\nWe provide simulations that compare our algorithm to oracle variants that have special knowledge of\nthe arms and a naive benchmark that learns over each arm separately. Initially our algorithm has a\nhigh cost due to learning the similarities, but for settings with a large number of arms and a long time\nhorizon, the learned similarities pay off and improve the algorithm\u2019s long run performance.\n\nKT ).\n\n\u221a\n\nRelated Work As there is a vast literature on multi-arm bandits, we speci\ufb01cally focus on literature\nrelated to the stochastic contextual bandit problem, with an emphasis on nonparametric models. In\ncontextual bandits, in each trial the learner \ufb01rst observers a feature vector, refer to as \u201ccontext\u201d,\nassociated with each arm. The optimal reward is measured with respect to the context revealed at the\nbeginning of each trial. One approach is to directly optimize and learn over a given space of policies\nrather than learn the reward functions [3, 5, 12, 14]. These methods do not require strict assumptions\non the reward functions but instead depend on the complexity or size of the model class.\nWe focus on the alternative approach of approximating reward functions, which then depend on\nassumptions about the structure of the reward function. A common assumption to make is that\nthe reward function is linear with respect to the observed context vector [15, 1, 2, 13], such that\nthe reward estimation task reduces to learning coef\ufb01cient vectors of the reward functions.\n[2]\nincorporates sparsity assumptions for the high dimensional covariate setting, and [13] imposes low\nrank assumptions on the coef\ufb01cient vectors to reduce the effective dimension.\nIn the linear bandit setting with K arms but only \u0398 arm types for \u0398 (cid:28) K, Gentile et al proposed an\nadaptive clustering algorithm which maintains an undirected graph between the arms and progressively\nerase edges when the estimated coef\ufb01cient vectors of the pair of arms is above a set threshold [6].\nTwo arms of the same type are assumed to have the same coef\ufb01cient vector. The threshold is\nchosen as a function of the minimum separation condition between coef\ufb01cients vectors of different\ntypes, such that eventually the graph converges to \u0398 connected components corresponding to the\n\u0398 types. Collaborative \ufb01ltering bandits [16] applies the same adaptive clustering concept to the\nrecommendation system setting where both users and item types must be learned.\nIn the nonparametric setting, instead of \ufb01xing a parametric model class such as linear, most work\nimposes smoothness conditions on the reward functions, and subsequently use nonparametric estima-\n\n2\n\n\fd+1\n\ntors such as histogram binning, k nearest neighbor, or kernel methods to learn the reward functions\n[24, 20, 18, 19, 7]. As the contexts are observed, the estimator is applied to learn the reward of each\narm separately, essentially assuming the number of arms is not too large. [7] provides an upper bound\non regret of \u02dcO(KT\nd+2 ), where d is the dimension of the context space, and K is the number of arms.\nThe setting of continnum arm bandits has been introduced to approximate setting with very large\naction spaces. As there are in\ufb01nitely many arms, it is common to impose smoothness with respect to\nsome metric amongst the arms [17, 22, 8, 10, 9]. As the joint context-arm metric is known, these\nmethods apply various smoothing techniques implemented via averaging datapoints with respect to a\npartitioning of the context-arm space, re\ufb01ning the smoothing parameter as more data is collected.\n[7] uses a k nearest neighbor estimator using the joint context-arm metric. The contextual zooming\nalgorithm from Slivkins [22] was a key inspiration for our proposed algorithm; it uses the given\ncontext-arm metric to adaptively partition the context-arm product space [22]. This enables the\nalgorithm to ef\ufb01ciently allocate more observations to regions of the context-arm space that are near\noptimal. When T is the time horizon and d is the covering dimension of the context-arm product\nspace, the regret of the contextual zooming algorithm is bounded above by \u02dcO(T\nFor settings with large but \ufb01nite number of arms, there are nonparametric models which assume\ndifferent types information is known about the relationship amongst the arms. Gaussian process\nbandits use a known covariance matrix to \ufb01t a Gaussian process over the joint context-arm space\n[11]. Taxonomy MAB assumes that similarity structure amongst the arm is given in terms of a\nhierarchical tree rather than metric [21]. Deshmukh et al assume that the kernel matrix between\npairs of arms is known, and they subsequently use kernel methods to estimate the reward functions.\nCesa-Bianchi et al assumes that a graph re\ufb02ecting arm similarities is given to the algorithm, and their\nalgorithm subsequently uses the Laplacian matrix of the given graph to regularize their estimates of\nthe reward functions [4]. Wu et al assumes an in\ufb02uence matrix amongst arms is known and used to\nshare datapoints among connected arms in the estimation [23]. A limitation of these approaches is\nthat they assumes similarity information is provided to the algorithm either as a metric, kernel, or via\na graph structure. In real-world applications, this similarity information is often not readily available\nand must be itself learned from the data.\n\nd+2 ).\n\nd+1\n\n2 Problem Statement\nAssume that the context at each trial t \u2208 [T ] is sampled independently and uniformly over the unit\ninterval, xt \u223c U (0, 1), and revealed to the algorithm. Assume there are K arms, or options, that the\nalgorithm can choose amongst at each trial t. If the algorithm chooses arm at at trial t, it observes\nand receives a reward \u03c0t \u2208 R according to \u03c0t = fat(xt) + \u0001t, where \u0001t \u223c N (0, \u03c32) is an i.i.d\nGaussian noise term with mean 0 and variance \u03c32, and fa(x) denotes the expected reward for arm\na as a function of the context x. We assume that each arm reward function fa : [0, 1] \u2192 [0, 1] is\nL-Lipschitz, i.e. for all x, x(cid:48) \u2208 [0, 1]2, |fa(x) \u2212 fa(x(cid:48))| \u2264 L|x \u2212 x(cid:48)|.\n\nThe goal of our problem setting is to maximize the total expected payoff(cid:80)T\n\nt=1 \u03c0t over the time\n\nhorizon T . We provide upper bounds on the expected contextual regret,\n\nE [R(T )] := E(cid:104)(cid:80)T\n\n(cid:105)\nt=1 (f\u2217(xt) \u2212 fat(xt))\n\nwhere f\u2217(x) := maxa\u2208[K] fa(x).\n\n(1)\n\nWe would like to understand whether an algorithm can ef\ufb01ciently exploit latent structure amongst the\narm reward functions if it exists. Although the number of arms may be large, they could be drawn\nfrom a smaller set of \ufb01nite arm types. Alternatively the arms could be draw from a continuum metric\nspace such that the reward function is jointly Lipschitz over the context-arm space; however our\nalgorithm would not have access to or knowledge of the underlying representation in the metric space.\n\n3 Algorithm Intuition\n\nWe begin by describing an oracle algorithm that is given special knowledge of the relationship\nbetween the arms in the form of the context-arm metric. Assume that the arms are embedded into\na metric space, and the function is Lipschitz with respect to that metric. The contextual zooming\nalgorithm proposed by Slivkins in [22] reduces the large continuum arm set to the effective dimension\n\n3\n\n\fof the underlying metric space. Essentially, their model assumes that each arm is associated to some\nknown parameter \u03b8a \u2208 [0, 1], and that the expected joint payoff function is 1-Lipschitz continuous in\nthe context-arm product space with respect to a known metric D, such that for all context-arm pairs\n(x, a) and (x(cid:48), a(cid:48)), |f (x, \u03b8a) \u2212 f (x(cid:48), \u03b8a(cid:48))| \u2264 D((x, \u03b8a), (x(cid:48), \u03b8a(cid:48))).\nThe key idea of Slivkin\u2019s zooming algorithm is to use adaptive discretization to encourage the\nalgorithm to obtain more re\ufb01ned estimates in the nearly optimal regions of the space while allowing\ncoarse estimates in suboptimal regions of the context-arm space. The algorithm maintains a partition\nof the context-arm space consisting of \u201cballs\u201d, or sets, of various sizes. The algorithm estimates\nthe reward function within a ball by averaging observed samples that lie within this ball. An upper\ncon\ufb01dence bound is obtained by accounting for the bias (proportional to the \u201cdiameter\u201d of the ball\ndue to Lipschitzness) and the variance due to averaging noisy observations within the ball. When a\ncontext arrives, the UCB rule is used to select a ball in the partition, and subsequently an arm in that\nball. When the number of observations in a ball increases beyond the threshold such that the variance\nof the estimate is less than the bias, then the algorithm splits the ball into smaller balls, re\ufb01ning the\npartition locally in this region of the context-arm space.\nThe main intuition of the analysis is that the UCB selection rule guarantees (with high probability)\nthat when a ball with diameter \u2206 is selected, the regret incurred by selecting this ball is bounded\nabove by order \u2206. As a result, this algorithm is able to exploit the arm similarities via the joint metric\nin order to aggregate samples of similar arms such that the estimates will converge more quickly.\nSubsequently the algorithm re\ufb01nes the estimates and subpartitions the space as needed for regions\nthat are near optimal and thus require tighter estimates in order to allow the algorithm to narrow in on\nthe optimal arm. The limitation of the previous Zooming algorithm is that it depends crucially on the\ngiven knowledge of the context-arm joint metric, which could be unknown in advance.\n\nArm Similarity Estimation In our model, we are not given any metric or features of the arm, thus\nthe key question is whether it is still possible for an algorithm to exploit good structure amongst\nthe arms if it exists. We propose an algorithm inspired by Slivkin\u2019s contextual zooming algorithm,\nwhich also adaptively partitions the context-arm space with the goal to allow for coarse estimates that\nconverge quickly initially, and subsequently selectively re\ufb01ne the partition and the corresponding\nestimates in regions of the context-arm space that are nearly optimal. The key challenge to deal with\nis determining how to subpartition amongst the arms when we do not know any underlying metric or\nfeature space. Our algorithm estimates a similarity (or distance) from the collected data itself, and\nuses the data-dependent distances to cluster/subpartition amongst the arms. This concept is similar to\nclustering bandits which also learns data-driven similarities, except that the clustering bandits works\nassume linear reward functions and \ufb01nite types, whereas our model and algorithm is more general for\nnonparametric functions and arms drawn from an underlying continuous space [6].\nWe want our algorithm to partition the context-arm product space into balls, or subsets, within\nwhich the maximum diameter is bounded, where diameter of a subset is de\ufb01ned as diam(S) :=\nsup(x,a)\u2208S fa(x) \u2212 inf (x(cid:48),a(cid:48))\u2208S fa(cid:48)(x(cid:48)). We consider balls \u03c1 \u2286 [0, 1] \u00d7 [K] which have the form of\n[c0(\u03c1), c1(\u03c1)]\u00d7A(\u03c1), where c0(\u03c1) \u2208 [0, 1] denotes the start of the context interval, c1(\u03c1) \u2208 [c0(\u03c1), 1]\ndenotes the end of the context interval, and A(\u03c1) \u2286 [K] denotes the subset of arms. We use\n\u2206(\u03c1) := c1(\u03c1) \u2212 c0(\u03c1) to denote the \u201cwidth\u201d of the context interval pertaining to the ball \u03c1.\nIn order to \ufb01gure out which set of arms to include in a \u201cball\u201d such that the diameter is bounded,\nwe ideally would like to measure the L\u221e distance with respect to the context interval of the ball,\nmaxx\u2208[c0(\u03c1),c1(\u03c1)] |fa(x) \u2212 fa(cid:48)(x)|. As the functions are assumed to be Lipschitz with respect to\nthe context space, a bound on the L2 distance also implies a bound on the L\u221e. Our algorithm\napproximates the L2 distance, de\ufb01ned with respect to an interval [u, v] according to\n\n(cid:17)2\n\n(cid:16)\n\n4\n\nwhere zi(u, v) =(cid:0)1 \u2212 i\n\nDv\nu(a, a(cid:48)) :=\n\n200\n\n(cid:80)\n\n(cid:114)\n(cid:1) u+ i\n\n1\n\n200\n\ni\u2208[200]\n\nfa(zi(u, v)) \u2212 fa(cid:48)(zi(u, v))\n\n(2)\n\n200 v. This is a \ufb01nite sum approximation to the integrated L2 distance\n\nbetween fa and fa(cid:48) within the interval [u, v].\nOur algorithm uses the data collected for an arm in order to approximate the reward functions using a\nk nearest neighbor estimator, and subsequently uses the estimated reward functions to approximate\nDv\nu. These approximate distances are then used to cluster the arms when subpartitioning. With high\nprobability, we show that the diameter of the constructed balls is bounded by 2L\u2206(\u03c1). Our algorithm\n\n\fcollects extra samples to compute these distances, and a key part of the analysis is to understand\nwhen the improvement in the learning rate of the reward functions is suf\ufb01cient enough to offset the\ncost of estimating arm distances.\n\n4 Algorithm Statement\n\nLet nt(\u03c1) =(cid:80)t\u22121\n(cid:80)t\u22121\n\n\u00b5t(\u03c1) = 1\n\ns=1\n\nnt(\u03c1)\n\ns=1\n\nI (\u03c1s = \u03c1) denote the number of times \u03c1 has been selected before trial t. Let\nI (\u03c1s = \u03c1) \u03c0s denote the average observed reward from \u03c1 before trial t. De\ufb01ne\n\nU CBt(\u03c1) = \u00b5t(\u03c1) + 2L\u2206(\u03c1) +(cid:112)6\u03c32 ln(T )/nt(\u03c1),\n\n(3)\n\nwhich gives an upper con\ufb01dence bound for the maximum reward achievable by any context-arm pair\nin the ball \u03c1. The algorithm maintains two sets of balls, P and P\u2217, such that P \u222a P\u2217 is a partition\nof the context-arm space, i.e. all balls are disjoint and the union cover the entire space. We refer to\nballs in P\u2217 as \ufb02agged. They are given ultimate priority in the algorithm, until suf\ufb01cient samples are\ncollected to further subpartition this ball via clustering. We refer to balls in P as \u201cactive\u201d, within\nwhich priority is given to balls with higher upper con\ufb01dence bound (UCB).\n\nBall-Arm Selection Rule\nIn a given trial t, when the context xt arrives, the algorithm identi\ufb01es the\n\ufb02agged balls \u03c1 \u2208 P\u2217 which contain context xt, i.e. xt \u2208 [c0(\u03c1), c1(\u03c1)], and gives priority amongst\nthem to balls with larger width \u2206(\u03c1),\n\n\u03c1t = argmax\u03c1\u2208P\u2217 \u2206(\u03c1)I (xt \u2208 [c0(\u03c1), c1(\u03c1)]) .\n\nIf there are no \ufb02agged balls in P\u2217 which contain xt, then the algorithm selects an active ball \u03c1 \u2208 P\ncontaining xt, and gives priority to the ball with the highest upper con\ufb01dence bound U CBt(\u03c1),\n\n\u03c1t = argmax\u03c1\u2208P U CBt(\u03c1)I (xt \u2208 [c0(\u03c1), c1(\u03c1)]) .\n\n(4)\nWhen a ball \u03c1t is chosen, the algorithm plays an arm at \u2208 A\u03c1t via a round robin ordering. The\nalgorithm observes a noisy reward \u03c0t for arm at and updates nt(\u03c1), \u00b5t(\u03c1), and U CBt(\u03c1) accordingly.\nBy grouping the context-arm pairs into balls, the algorithm aggregates the observed rewards within a\nball to trade-off between bias and variance. For any given trial, the algorithm reduces the decision\nproblem from selecting amongst a large number of arms to selecting amongst a smaller set of balls,\nwhich each consist of a subset of arms. Whenever the ball is subpartitioned, the width of the context\ninterval is halved, such that balls never repeat, and are always strictly nested within a hierarchy.\nFurthermore, the fact that the algorithm gives priority to \ufb02agged balls with larger context widths\nimplies that the data collected in the \u201c\ufb02agged\u201d phase of every ball will be uniformly distributed over\ncontext width of that ball.\n\nFlagging Rule At the beginning of the algorithm, the entire context-arm space is \ufb02agged as a single\nlarge ball to be subpartitioned, i.e. P\u2217 = {([0, 1] \u00d7 [K])} and P = \u2205. In subsequent rounds, we \ufb02ag\na ball \u03c1 \u2208 P whenever it satis\ufb01es the condition nt(\u03c1t) > 6\u03c32 ln(T )/L2\u22062(\u03c1t). Upon being \ufb02agged,\n\u03c1 is removed from P and added to P\u2217. Let stopping time \u03c4f (\u03c1) denote the trial t that ball \u03c1 is \ufb02agged.\nIntuitively, the threshold is chosen at a point where the con\ufb01dence radius, i.e. natural variation in the\nestimates due to the additive Gaussian observation error, is on the order of the diameter of the ball.\nAs a result, further collecting samples does not improve the overall UCB because the diameter of the\nball will dominate the expression.\nSub-Partitioning via Clustering Recall that \ufb02agged balls in P\u2217 are always given priority over\nactive balls in P. The observations collected in the \ufb02agged phase are used to estimate distances,\nor similarities between the arms for the purpose of subpartitioning the ball into smaller balls. In\nparticular, the algorithm splits the context space [c0(\u03c1), c1(\u03c1)] into 64 evenly sized intervals and waits\nuntil it collects at least k samples within each of the 64 intervals for each of the arms a \u2208 A(\u03c1), where\nk is chosen according to k = 5431\u03c32 ln(T|A(\u03c1)|)/(L2\u22062(\u03c1)). This condition is mathematically\n\nstated as(cid:81)\n\na\u2208A(\u03c1) SUFFDATA(a) == 1 where\n\nSUFFDATA(a) :=(cid:81)64\n\ni=1\n\nI(cid:16)(cid:80)\n\nI (\u03c1s = \u03c1, as = a) I (xs \u2208 [wi\u22121, wi]) \u2265 k\n\ns>\u03c4f (\u03c1)\n\n5\n\n(cid:17)\n\n,\n\n\ffor wi = c0(\u03c1) + i\u2206(\u03c1)/64. When this suf\ufb01cient data condition is satis\ufb01ed, the algorithm uses the\nobservations collected in the \ufb02agged phase to compute pairwise arm distances approximating (2). Let\n\u03c4cl(\u03c1) denote the trial in which the suf\ufb01cient data condition is satis\ufb01ed and \u03c1 is subpartitioned.\nThe SUBPARTITION subroutine estimates the reward functions via a k-nearest neighbor estimator,\n\n\u02c6fa(x) = 1\nk\n\nI (\u03c1s = \u03c1, as = a) I (xs \u2208 k-NN) \u03c0s,\n\nwhere xs is a k nearest neighbor of x if(cid:80)\u03c4cl(\u03c1)\n\n(5)\nI (\u03c1(cid:96) = \u03c1, a(cid:96) = a) I (|x(cid:96) \u2212 x| \u2264 |xs \u2212 x|) \u2264 k.\nGiven the estimated functions { \u02c6fa}a\u2208A(\u03c1) and a pair of arms a, a(cid:48) \u2208 A(\u03c1), we compute \u02c6Dv\nu(a, a(cid:48))\nfor intervals [u, v] = [c0(\u03c1), (c0(\u03c1) + c1(\u03c1))/2] and [u, v] = [(c0(\u03c1) + c1(\u03c1))/2, c1(\u03c1)] according to\n\ns=\u03c4f (\u03c1)+1\n\n(cid:96)=\u03c4f (\u03c1)+1\n\n(cid:80)\u03c4cl(\u03c1)\n\n(cid:16) \u02c6fa(zi(u, v)) \u2212 \u02c6fa(cid:48)(zi(u, v))\n\n(cid:17)2 \u2212 2\u03c32\n\nk\n\ni\u2208[200]\n\n(6)\n\n\u02c6Dv\nu(a, a(cid:48)) :=\n\nwhere zi(u, v) =(cid:0)1 \u2212 i\n\n(cid:80)\n\n(cid:114)\n(cid:1) u + i\n\n200\n\n1\n\n200\n\nk accounts for bias due to the noise.\n\n200 v and the term 2\u03c32\nWe use the computed distances \u02c6Dv\nu(a, a(cid:48)) to subpartition \u03c1 by clustering the arms for each half of\nthe context interval separately. For an arbitray ordering of the arms, we test if the next arm has\ndistance less than 3L(v \u2212 u)/16 to any of the existing cluster centers. If so, we assign it to the cluster\nassociated to the closest cluster center. Otherwise, we create a new cluster and assign this arm to be\nthe cluster center. This results in a clustering in which all pairs of cluster centers are guaranteed to be\ndistance 3L(v \u2212 u)/16 apart, and all members of a cluster must be within distance 3L(v \u2212 u)/16\nto the cluster center. These distances are measured with respect to the data dependent estimates\nu(a, a(cid:48)). In our analysis, we show that with high probability \u02c6Dv\n\u02c6Dv\nOnce the clusters are created, then \u03c1 is un\ufb02agged (removed from P\u2217) and new balls corresponding\nto each of the clusters for each half of the context interval are added to the active set P. See the\nappendix for a pseudocode description of the algorithm.\n\nu(a, a(cid:48)) \u2248 Dv\n\nu(a, a(cid:48)).\n\n5 Simulation\n\nWe test our algorithm on a model with 50, 100, 200 arms and a context space of [0, 1]. Each arm a\ncorresponds to a parameter \u03b8a uniformly spaced out within [0, 1]. The expected reward for arm a and\ncontext x is\n\nfa(x) := g(x, \u03b8a) = 1 \u2212(cid:12)(cid:12)x \u2212 4 minz\u2208{0,0.5,1} |\u03b8a \u2212 z|(cid:12)(cid:12).\n\nu(a, a(cid:48)) approximates Dv\n\nThis function is periodic with respect to \u03b8, and can be depicted as a zigzag. Our distance\nestimate \u02c6Dv\nu(a, a(cid:48)), which is de\ufb01ned with respect fa and fa(cid:48) directly\nand does not depend on \u03b8a. Consider a measure preserving transformation that maps \u03b8a to\n\u03c6a = 4 minz\u2208{0,0.5,1} |\u03b8a \u2212 z|, such that the reward function is equivalently described by\nfa(x) = 1 \u2212 |x \u2212 \u03c6a|. An algorithm which partitions with respect to Dv\nu(a, a(cid:48)) would be ag-\nnostic to such a transformation, as opposed to an algorithm which depends on a metric de\ufb01ned with\nrespect the arm\u2019s representation, which would perform worse on \u03b8a than \u03c6a.\nWe benchmark the performance of our Approx-Zooming algorithm against three variations:\n\u2022 Approx-Zooming -With-True-Reward-Function: We give the Approx-Zooming algorithm oracle\nu(a, a(cid:48)) at no cost, which is used to subpartition whenever a ball is \ufb02agged.\n\u2022 Approx-Zooming -With-Similarity-Metric: We give the Approx-Zooming algorithm oracle access\n\u2022 Approx-Zooming -With-No-Arm-Similarity: This naive variant uses no arm similarities, estimating\neach arm\u2019s reward independently. The context space is adaptively partitioned via our algorithm.\n\naccess to evaluate Dv\nto evaluate |\u03b8a \u2212 \u03b8a(cid:48)| at no cost, which is used to subpartition whenever a ball is \ufb02agged.\n\nWe chose the model parameters that led to the highest average cumulative reward in each baseline\nalgorithm. For all algorithms the \ufb02agging rule is set to nt(\u03c1) \u2265 4 ln(T )/\u22062, and \u03c3 was set to 1e \u2212 2.\nFor Approx-Zooming , k was set to 10. We set the number of trials T to 100, 000 as all the algorithms\nhad converged to their optimal point by then. Additional details on how the model parameters were\nchosen is given in Appendix F.\n\n6\n\n\f(cid:80)T\n\nt=1 \u03c0t, where T is the\nIn \ufb01gure 1, we plot the average cumulative reward over the trials, i.e. 1\ntotal number of trials and \u03c0t \u2208 (0, 1) is the reward observed in the tth trial. We plot the result for\nT\nthe 200 arm setting with \u03c3 set to 1e \u2212 2. As we can see, the oracle variant of the algorithm that\nuses the true reward function to calculate Dv\nu(a, a(cid:48)) achieves the best cummulative reward across the\nentire time horizon. Not surprisingly, the algorithm which learns each arm separately takes more\ntime to converge to the optimal policy compared to all the other methods. Our Approx-Zooming\nalgorithm has a heavy cost up front due to the clustering of the arms globally, but the algorithm\nimproves over the time horizon as it learns the correct arm similarities. The oracle variant which uses\nthe similarity metric |\u03b8a \u2212 \u03b8a(cid:48)| performs worse than the true Dv\nu(a, a(cid:48)) variant, as it does not account\nfor the periodic nature of the function. This supports our intuition that algorithms which depend on a\ngiven metric are sensitive to the choice of a good vs bad metric.\n\nFigure 1: Avg. cumulative reward vs. number of trials\n\nFigure 2: Approx-Zooming Selected Arm Frequency Over The Trials\n\nIn \ufb01gure 2 we plot the frequencies an arm is selected in different contexts over the T trials. Each of\nthe four plots corresponds to averaging the frequency over T /4 trials across the time horizon. The\nx-axis refers to the context space, and the y-axis refers to the set of arms. Initially the frequency plot is\nvery blurry, indicating that our algorithm is not necessarily playing the optimal arm but selecting arms\nto learn the latent arm structure. As time progresses our algorithm learns the similarities amongst\narms and gradually plays the arms using the latent structure, which is depicted by the zigzag shape\nsharpening. Finally, in the last trials Approx-Zooming plays the optimal policy, which corresponds to\nthe clear zigzag. In Appendix F we present similar plots for the benchmark algorithms.\nOur simulations show that when the number of arms is large, it is important to use similarities\namongst arms to more quickly learn the optimal policy. In addition our results highlight the fact\nthat metric-based algorithms may be sensitive to the choice of metric, which is not a trivial task. In\ncontrast, our approach relies on samples from the reward distribution to learn the latent structure, and\nis thus agnostic to any choice of metric. However, the parameter k needs to be carefully tuned for\nour algorithm to avoid unnecessary sampling for estimating similarities. In Appendix F we include\nsimilar plots for other parameters of the problem, in particular for smaller number of arms. We see\nthat for 50 arms or 100 arms, the cost due to the added extra exploration may exceed the gain from\nlearning the metric, and thus we anticipate that the bene\ufb01ts of learning the metric only dominates in\nregimes where the number of arms is large and the time horizon is suf\ufb01ciently long.\n\n7\n\n02004006008001000No. of trials * 1000.70.80.9Avg cumm. rewardApprox-ZoomingApprox-Zooming-No-Arm-SimilarityApprox-Zooming-Similarity-MetricApprox-Zooming-True02468t = 2500002346699211513816118402468t = 5000002346699211513816118402468t = 7500002346699211513816118402468t = 100000023466992115138161184Context IntervalFinite Arms\f6 Upper bound on the Regret\n\nMi =(cid:80)2i\n\n(cid:96)=1\n\nWe present a general bound on the regret expressed as a function of a quantity relating to the local\ngeometry of the reward function nearby the optimal policy. Let us denote wi((cid:96)) = [((cid:96) \u2212 1)2\u2212i, (cid:96)2\u2212i],\n\u03ba(x) = f\u2217(x) \u2212 maxa\u2208[K] fa(x) I (fa(x) (cid:54)= f\u2217(x)), and\n\nI(cid:0)minx\u2208wi((cid:96)) \u03ba(x) \u2264 20 L 2\u2212i(cid:1)(cid:80)\n(cid:16)\n\n\u03c32L\u22122K ln(T K) + min\nimax\u2208Z+\n\na\u2208[K]\n\nI(cid:0)(f\u2217(2\u2212i(cid:96)) \u2212 fa(2\u2212i(cid:96))) \u2264 22 L 2\u2212i(cid:1) .\n\u03c32L\u22121Mi2i ln(T K)(cid:1)(cid:17)\n\nimax\u22121(cid:88)\n\n.\n\n(cid:0)LT 2\u2212imax +\n\nTheorem 6.1. The expected contextual regret of Approx-Zooming is bounded above by\n\nE [R(T )] = O\n\ni=1\n\nThe analysis relies on showing that the instantaneous regret incurred by choosing a ball with context\nwidth \u2206 is bounded above by O(L\u2206). The \ufb01rst term in the regret is due to the very \ufb01rst intitial\nclustering phase. The second term L T 2\u2212imax bounds the regret incurred by all balls with context\nwidth at most 2\u2212imax. The terms in the summation bound the regret incurred by balls with context\nwidth equal to 2\u2212i. The function \u03ba(x) represents the lowest regret achieved by the second-most\noptimal arm, which lower bounds the suboptimality gap. In alignment with our intuition from\nclassical MAB, when the suboptimality gap is large, the algorithm is able to more quickly converge\nto the optimal arm at context x. When we bound the regret incurred by all balls with context width\n2\u2212i, we can thus remove subintervals of the context for which \u03ba(x) is large as the algorithm will\nhave already converged to the optimal arm. This is re\ufb02ected in the \ufb01rst indicator function within\nthe expression Mi. Once restricted to context subintervals where the suboptimality gap is not too\nwhich the suboptimality gap is at most 22 L 2\u2212i; arms for which the suboptimality gap is larger will\nhave already been deemed suboptimal. As the speci\ufb01c bounds on Mi depend on the model and local\ngeometry amongst the arms, we provide bounds for two concrete examples to give more intuition.\nFinite Types Suppose that the reward functions for the K arms, {fa}a\u2208[K] only takes \u0398 different\nvalues. Essentially, this implies that there are \u0398 different types of arms, but we don\u2019t know the arm\ntypes a priori. Within each type, the reward function is exactly the same. Let us de\ufb01ne\n\nI(cid:0)(f\u2217(2\u2212i(cid:96)) \u2212 fa(2\u2212i(cid:96))) \u2264 22 L 2\u2212i(cid:1) counts the number of arms for\n\nlarge, the expression(cid:80)\n\na\u2208[K]\n\n\u00b5\u03ba(z) := \u00b5({x \u2208 [0, 1] s.t. \u03ba(x) \u2264 z})\n\nwhere \u00b5 is the Lebesgue measure. Then we can show that Mi \u2264 2iK\u00b5\u03ba(22 L 2\u2212i). The regret is\nbounded by the local measure function \u00b5\u03ba. In the \ufb01nite types setting, the optimal policy corresponds\nto partitioning the context space [0, 1] into a set of intervals, S\u2217, such that across each interval\n\u222b \u2208 S\u2217, the optimal policy does not change. Let us consider the setting that \u03ba(x) decreases\nlinearly fast nearby the points where the optimal policy changes, so that for some constant L(cid:48),\n\u00b5\u03ba(22 L 2\u2212i) \u2264 22 |S\u2217| L 2\u2212i/L(cid:48). By plugging the bound on Mi into the main theorem and\nchoosing imax = log(L(cid:48)LT /22\u03c32|S\u2217|K ln(T K))/2, it follows that\n\n(cid:17)\n\u03c32L\u22122K ln(T K) +(cid:112)\u03c32|S\u2217|LL(cid:48)\u22121T K ln(T K)\n\nE [R(T )] \u2264 O\n\n(cid:16)\n\n(7)\n\nLipschitz with respect to continuous arm metric space Suppose that each arm a is associated\nto a latent feature \u03b8a \u2208 [0, 1], and the expected reward function fa(x) = g(x, \u03b8a), where g :\n[0, 1] \u00d7 [0, 1] \u2192 [0, 1] is a L-Lipschitz function with respect to both the contexts and the arm latent\nfeatures such that |g(x, \u03b8) \u2212 g(x(cid:48), \u03b8(cid:48))| \u2264 L(|x \u2212 x(cid:48)| + |\u03b8 \u2212 \u03b8(cid:48)|). If we assume that the arm latent\nfeatures are uniformly spread out, {\u03b8a} = {i/K}i\u2208[K], then\n\nMi \u2264(cid:80)\n\n(cid:80)\n\nI(cid:0)(f\u2217(2\u2212i(cid:96)) \u2212 g(2\u2212i(cid:96), j\n\nK )) \u2264 22 L 2\u2212i(cid:1) ,\n\nj\u2208[K]\n\n(cid:96)\u2208[2i]\n\nis at most 22 L 2\u2212i. We can visualize(cid:80)2i\n\nwhich is a discrete approximation to the area of the context-arm space for which the suboptimality gap\n(cid:96)=1 Mi((cid:96)) by considering the contour plot of f\u2217(x)\u2212g(x, \u03b8),\nand counting how many grid points {(2\u2212i(cid:96), j/K)}(cid:96)\u2208[2i],j\u2208[K] are lower than 22L2\u2212i. For large i and\nK, this is approximately 2iK\u00b5({(x, \u03b8) : g(x, \u03b8) \u2212 f\u2217(x) \u2265 \u221222L2\u2212i}), where \u00b5 is the Lebesgue\nmeasure. The curve at the lowest level of the contour plot corresponds to the set {(x, \u03b8) s.t. g(x, \u03b8)\u2212\n\n(8)\n\n8\n\n\ff\u2217(x) = 0}, which contains for each context the set of arm features that optimize the expected\nreward. The \ufb01nal regret depends on the local measure of the joint reward function.\nAs an example, if we consider the reward function g(x, \u03b8) = 1 \u2212 L|x \u2212 \u03b8| for some L \u2208 (0, 1), we\ncan show that Mi \u2264 44K, i.e. it is bounded by a constant with respect to i. Therefore by plugging\n\ninto the main theorem and choosing imax = log(cid:0)20L2T /\u03c32K ln(T K)(cid:1) /2 it follows that\n\n(cid:16)\n(cid:17)\n\u03c32L\u22122K ln(T K) +(cid:112)\u03c32KT ln(T K)\n\n.\n\n(9)\n\nE [R(T )] \u2264 O\n\n7 Discussion\n\nInterpreting the results. We began this paper with the question: Can an algorithm exploit hidden\nstructure in a nonparametric contextual bandit problem with no a priori knowledge of the underlying\nmetric? The results of our simulations suggest that our proposed algorithm (with empirically tuned\nhyperparameters) can perform better than the corresponding algorithm that learns over each arm\nseparately, or that uses a suboptimal metric. However, the regret bounds we present are not suf\ufb01ciently\nstrong to provably show that the algorithm outperforms learning on arms separately. The stated upper\nbound on regret in [7] is linear in the number of arms K, however this may be simply due to the fact\nthat they did not optimize with respect to K in their analysis. Our regret bound is most comparable\nto the regret for the in\ufb01nite arm setting presented in [22], and it can be recovered from their bound by\nimposing the discrete metric amongst the arms.\nGeneralizing to higher context dimension. For simplicity, we have stated our algorithm and\nanalysis for the 1D context space, but the results extend to the general d-dimensional setting. The\nonly change required algorithmically is in the subpartitioning/clustering step. Let us de\ufb01ne Cd(q) to\nbe the number of balls of radius r/q needed to cover a ball of radius r, which scales exponentially in\nthe dimension d, e.g. qd. Since we are now estimating the reward function f over a d-dimensional\ncontext space, the number of sub-regions of the context space that need to be clustered will be\nCd(2), and the number of samples needed to guarantee that the k-nearest neighbor samples are within\n16radius, will be equal to \u02dcO(kCd(32)). To compute \u02c6D, we will instead have a d-dimensional\ndistance 1\nsummation over the subset of the context space. Once \u02c6D is computed, then the clustering of arms\nwill have the same computational cost, i.e. linear in number of arms to be clustered. The analysis can\nbe modi\ufb01ed to account for the d-dimensional setting, and the \ufb01nal regret bound will look like\n\n(cid:16)\n\nO\n\nCd(2)Cd(32)\u03c32L\u22122K ln(T K)+ min\nimax\u2208Z+\n\n(cid:0)LT 2\u2212imax +\n\nimax\u22121(cid:88)\n\nCd(2)Cd(32)\u03c32L\u22121Mi2i ln(T K)(cid:1)(cid:17)\n\n,\n\ni=1\n\nwhere Mi instead sums over an \u0001-net of the context space for \u0001 = 2\u2212i, and thus we may expect Mi to\ngrow exponentially in i \u00d7 d, depending on the distribution of the reward function and the \ufb01nite arms.\nThe growth of Mi will dominate the regret bound with respect to the dependence on the dimension d.\nChoice of metric. Nature could apply a measure preserving transformation to the arm metric space\nsuch that the joint function has a signi\ufb01cantly higher Lipschitz constant. This representation would\nincur a worse performance by the previous Zooming algorithm, indicating that the algorithm is\ncritically dependent on the choice of representation and metric. As an example, suppose that arms are\neach associated to some latent parameter \u03b8 \u2208 (0, 2\u03c0), and the reward function associated to an arm a\nis fa(x) = x + sin(L\u03b8a). The Lipschitz constant with respect to \u03b8 is L. By applying a change of\nvariables from \u03b8 to t(\u03b8) = L\u03b8 mod 2\u03c0, the associated reward function in terms of the representation\nta = t(\u03b8a) would be fa(x) = x + sin(ta), which only has Lipschitz constant 1 with respect to the\nreparamerization t. This is only a simple example amongst many that illustrate the importance of\nthe choice of metric for learning. In contrast, our algorithm estimates similarity amongst the arms\ndirectly from data collected from the reward functions, which essentially estimates distance in the\nfunction space; as a result our algorithm is invariant to any speci\ufb01c covariate representation.\nFuture Work. The current results are stated only for Lipschitz reward functions, where the tuning\nparameters depend on the Lipschitz constant. It may be interesting to generalize the algorithm to\nHolder continuous reward functions, and consider how to adapt the algorithm to the smoothness\nparameters if unknown. It would also be interesting to explore the connections to Gaussian process\nbandits. One would need to specify the covariance matrix amongst arms, and it may be possible to\nconsider empirically estimating the covariance matrix in the process of learning.\n\n9\n\n\fAcknowledgement\n\nThis work is supported by the National University of Singapore and A*STAR - SERC PSF Grant\n1521200084. We thank Professor David S. Rosenblum for his support of this project through insightful\ndiscussions and feedback.\n\nReferences\n[1] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In\n\nInternational Conference on Machine Learning, pages 127\u2013135, 2013.\n\n[2] H. Bastani and M. Bayati. Online decision-making with high-dimensional covariates. Available\n\nat SSRN 2661896, 2015.\n\n[3] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandit algorithms\n\nwith supervised learning guarantees. arXiv preprint arXiv:1002.4058, 2010.\n\n[4] A. A. Deshmukh, U. Dogan, and C. Scott. Multi-task learning for contextual bandits. In\n\nAdvances in Neural Information Processing Systems, pages 4848\u20134856, 2017.\n\n[5] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Ef\ufb01cient\n\noptimal learning for contextual bandits. arXiv preprint arXiv:1106.2369, 2011.\n\n[6] C. Gentile, S. Li, and G. Zappella. Online clustering of bandits. In International Conference on\n\nMachine Learning, pages 757\u2013765, 2014.\n\n[7] M. Y. Guan and H. Jiang. Nonparametric stochastic contextual bandits. In Thirty-Second AAAI\n\nConference on Arti\ufb01cial Intelligence, 2018.\n\n[8] E. Hazan and N. Megiddo. Online learning with prior knowledge. In International Conference\n\non Computational Learning Theory, pages 499\u2013513. Springer, 2007.\n\n[9] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings\nof the fortieth annual ACM symposium on Theory of computing, pages 681\u2013690. ACM, 2008.\n\n[10] R. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in\n\nNeural Information Processing Systems, pages 697\u2013704, 2005.\n\n[11] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in\n\nNeural Information Processing Systems, pages 2447\u20132455, 2011.\n\n[12] A. Krishnamurthy, J. Langford, A. Slivkins, and C. Zhang. Contextual bandits with continuous\n\nactions: Smoothing, zooming, and adapting. arXiv preprint arXiv:1902.01520, 2019.\n\n[13] S. Lale, K. Azizzadenesheli, A. Anandkumar, and B. Hassibi. Stochastic linear bandits with\nhidden low rank structure. CoRR, abs/1901.09490, 2019. URL http://arxiv.org/abs/\n1901.09490.\n\n[14] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side\ninformation. In Advances in Neural Information Processing Systems 20, pages 817\u2013824. 2008.\n\n[15] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th international conference on World\nwide web, pages 661\u2013670. ACM, 2010.\n\n[16] S. Li, A. Karatzoglou, and C. Gentile. Collaborative \ufb01ltering bandits. In Proceedings of the 39th\nInternational ACM SIGIR conference on Research and Development in Information Retrieval,\npages 539\u2013548. ACM, 2016.\n\n[17] T. Lu, D. P\u00e1l, and M. P\u00e1l. Showing relevant ads via context multi-armed bandits. Technical\n\nreport, Tech. rep, 2009.\n\n[18] V. Perchet, P. Rigollet, et al. The multi-armed bandit problem with covariates. The Annals of\n\nStatistics, 41(2):693\u2013721, 2013.\n\n10\n\n\f[19] W. Qian and Y. Yang. Kernel estimation and model combination in a bandit problem with\n\ncovariates. The Journal of Machine Learning Research, 17(1):5181\u20135217, 2016.\n\n[20] P. Rigollet and A. Zeevi.\n\narXiv:1003.1630, 2010.\n\nNonparametric bandits with covariates.\n\narXiv preprint\n\n[21] A. Slivkins. Multi-armed bandits on implicit metric spaces. In Advances in Neural Information\n\nProcessing Systems, pages 1602\u20131610, 2011.\n\n[22] A. Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning\n\nResearch, 15(1):2533\u20132568, 2014.\n\n[23] Q. Wu, H. Wang, Q. Gu, and H. Wang. Contextual bandits in a collaborative environment. In\nProceedings of the 39th International ACM SIGIR conference on Research and Development in\nInformation Retrieval, pages 529\u2013538. ACM, 2016.\n\n[24] Y. Yang, D. Zhu, et al. Randomized allocation with nonparametric estimation for a multi-armed\n\nbandit problem with covariates. The Annals of Statistics, 30(1):100\u2013121, 2002.\n\n11\n\n\f", "award": [], "sourceid": 8294, "authors": [{"given_name": "Nirandika", "family_name": "Wanigasekara", "institution": "National University of Singapore"}, {"given_name": "Christina", "family_name": "Yu", "institution": "Cornell University"}]}