{"title": "Online Learning via the Differential Privacy Lens", "book": "Advances in Neural Information Processing Systems", "page_first": 8894, "page_last": 8904, "abstract": "In this paper, we use differential privacy as a lens to examine online learning in both full and partial information settings. The differential privacy framework is, at heart, less about privacy and more about algorithmic stability, and thus has found application in domains well beyond those where information security is central. Here we develop an algorithmic property called one-step differential stability which facilitates a more refined regret analysis for online learning methods. We show that tools from the differential privacy literature can yield regret bounds for many interesting online learning problems including online convex optimization and online linear optimization. Our stability notion is particularly well-suited for deriving first-order regret bounds for follow-the-perturbed-leader algorithms, something that all previous analyses have struggled to achieve. We also generalize the standard max-divergence to obtain a broader class called Tsallis max-divergences. These define stronger notions of stability that are useful in deriving bounds in partial information settings such as multi-armed bandits and bandits with experts.", "full_text": "Online Learning via the Differential Privacy Lens\n\nJacob Abernethy\u2217\nCollege of Computing\n\nGeorgia Institute of Technology\n\nprof@gatech.edu\n\nYoung Hun Jung\u2217\n\nDepartment of Statistics\nUniversity of Michigan\nyhjung@umich.edu\n\nChansoo Lee\u2217\nGoogle Brain\n\nchansoo@google.com\n\nAudra McMillan\u2217\n\nSimons Inst. for the Theory of Computing\n\nDepartment of Computer Science\n\nBoston University\n\nKhoury College of Computer Sciences\n\nNortheastern University\n\naudramarymcmillan@gmail.com\n\nAmbuj Tewari\u2217\n\nDepartment of Statistics\nDepartment of EECS\nUniversity of Michigan\ntewaria@umich.edu\n\nAbstract\n\nIn this paper, we use differential privacy as a lens to examine online learning in\nboth full and partial information settings. The differential privacy framework is, at\nheart, less about privacy and more about algorithmic stability, and thus has found\napplication in domains well beyond those where information security is central.\nHere we develop an algorithmic property called one-step differential stability which\nfacilitates a more re\ufb01ned regret analysis for online learning methods. We show\nthat tools from the differential privacy literature can yield regret bounds for many\ninteresting online learning problems including online convex optimization and on-\nline linear optimization. Our stability notion is particularly well-suited for deriving\n\ufb01rst-order regret bounds for follow-the-perturbed-leader algorithms, something that\nall previous analyses have struggled to achieve. We also generalize the standard\nmax-divergence to obtain a broader class called Tsallis max-divergences. These\nde\ufb01ne stronger notions of stability that are useful in deriving bounds in partial\ninformation settings such as multi-armed bandits and bandits with experts.\n\n1\n\nIntroduction\n\nStability of output in presence of small changes to input is a desirable feature of methods in statistics\nand machine learning [11, 19, 31, 42]. Another area of research for which stability is a core\ncomponent is differential privacy (DP). As Dwork and Roth [15] observed, \u201cdifferential privacy is\nenabled by stability and ensures stability.\u201d They argue that the \u201cdifferential privacy lens\u201d offers a fresh\nperspective to examine areas other than privacy. For example, the DP lens has been used successfully\nin designing coalition-proof mechanisms [25] and preventing false discovery in statistical analysis\n[10, 13, 17, 28].\nIn this paper, we use the DP lens to design and analyze randomized online learning algorithms in a\nvariety of canonical online learning problems. The DP lens allows us to analyze a broad class of online\nlearning problems, spanning both full information (online convex optimization (OCO), online linear\noptimization (OLO), experts problem) and partial information settings (multi-armed bandits, bandits\nwith experts) using a uni\ufb01ed framework. We are able to analyze follow-the-perturbed-leader (FTPL)\nas well as follow-the-regularized leader (FTRL) based algorithms resulting in both zero-order and\n\n\u2217Author order is alphabetical denoting equal contributions.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\ufb01rst-order regret bounds; see Section 2 for de\ufb01nitions of these bounds. However, our techniques are\nespecially well-suited to proving \ufb01rst-order bounds for perturbation based methods. Historically, the\nunderstanding of the regularization based algorithms has been more advanced thanks to connections\nwith ideas from optimization such as duality and Bregman divergences. There has been recent work\n[1] on developing a general framework to analyze FTPL algorithms, but it only yields zero-order\nbounds. A general framework that can yield \ufb01rst-order bounds for FTPL algorithms has been lacking\nso far, but we believe that the framework outlined in this paper may \ufb01ll this gap in the literature. Our\nrich set of examples suggests that our framework will be useful in translating results from the DP\nliterature to study a much larger variety of online learning problems in the future. This means that we\nimmediately bene\ufb01t from research advances in the DP community.\nWe emphasize that our aim is not to design low-regret algorithms that satisfy the privacy condition\u2013\nthere is already substantial existing work along these lines [4, 21, 39, 40]. Our goal is instead to show\nthat, in and of itself, a DP-inspired stability-based methodology is quite well-suited to designing\nonline learning algorithms with excellent guarantees. In fact, there are theoretical reasons to believe\nthis should be possible. Alon et al. [7] have shown that if a class of functions is privately learnable,\nthen it has \ufb01nite Littlestone dimension (a parameter that characterizes learnability for online binary\nclassi\ufb01cation) via non-constructive arguments. Our results can be interpreted as proving analogous\nclaims in a constructive fashion albeit for different, more tractable online learning problems.\nIn many of our problem settings, we are able to show new algorithms that achieve optimal or near-\noptimal regret. Although many of these regret bounds have already appeared in the literature, we\nnote that they were previously possible only via specialized algorithms and analyses. In some cases\n(such as OLO), the regret bound itself is new. Our main technical contributions are as follows:\n\u2022 We de\ufb01ne one-step differential stability (De\ufb01nitions 2.1 and 2.2) and derive a key lemma showing\n\u2022 New algorithms with \ufb01rst-order bounds for both OCO (Theorem 3.2) and OLO problems (Corol-\nlary 3.3) based on the objective perturbation method from the DP literature [23]. The OLO\n\ufb01rst-order bound is the \ufb01rst of its kind to the best of our knowledge.\n\u2022 We introduce a novel family of Tsallis \u03b3-max-divergences (see (2)) as a way to ensure tighter\nstability as compared to the standard max-divergence. Having tighter control on stability is crucial\nin partial information settings where loss estimates can take large values.\n\u2022 We provide optimal \ufb01rst-order bounds for the experts problem via new FTPL algorithms using a\n\u2022 Our uni\ufb01ed analysis of multi-armed bandit algorithms (Theorem 4.2) not only uni\ufb01es the treatment\nof a large number of perturbations and regularizers that have been used in the past but also reveals\nthe exact type of differential stability induced by them.\n\u2022 New perturbation-based algorithms for the bandits with experts problem that achieve the same\n\nhow it can yield \ufb01rst-order regret bounds (Lemma 3.1).\n\nvariety of perturbations (Theorem 3.6).\n\nzero-order and \ufb01rst-order bounds (Theorem 4.3) as the celebrated EXP4 algorithm [9].\n\n2 Preliminaries\nThe (cid:96)\u221e, (cid:96)2, and (cid:96)1 norms are denoted by (cid:107) \u00b7 (cid:107)\u221e,(cid:107) \u00b7 (cid:107)2, and (cid:107) \u00b7 (cid:107)1 respectively. The vector ei denotes\nthe ith standard basis vector. The norm of a set X is de\ufb01ned as (cid:107)X(cid:107) = supx\u2208X (cid:107)x(cid:107). A sequence\n(a1, . . . , at) is abbreviated a1:t, and a set {1, . . . , N} is abbreviated [N ]. For a symmetric matrix S,\n\u03bbmax(S) denotes its largest eigenvalue. The probability simplex in RN is denoted by \u2206N\u22121. Full\nversions of omitted/sketched proofs can be found in the appendix.\n\n2.1 Online learning\n\nWe adopt the common perspective of viewing online learning as a repeated game between a learner\nand an adversary. We consider an oblivious adversary that chooses a sequence of loss functions\n(cid:96)t \u2208 Y before the game begins. At every round t, the learner chooses a move xt \u2208 X and suffers\nloss (cid:96)t(xt). The action spaces X and Y will characterize the online learning problem. For example,\nin multi-armed bandits, X = [N ] for some N and Y = [0, 1]N . Note that the learner is allowed to\naccess a private source of randomness in selecting xt. The learner\u2019s goal is to minimize the expected\nregret after T rounds:\n\nE[RegretT ] = E(cid:104)(cid:80)T\n\n(cid:105) \u2212 minx\u2208X(cid:80)T\n\nt=1 (cid:96)t(xt)\n\nt=1 (cid:96)t(x),\n\n2\n\n\fT = minx\u2208X(cid:80)T\n\nwhere the expectations are over all of the learner\u2019s randomness, and we recall that here the (cid:96)t\u2019s are\nE[RegretT ] where A ranges over all\nnon-random. The minimax regret is given by minA max(cid:96)1:T\nlearning algorithms. If an algorithm achieves expected regret within a constant factor of the minimax\nregret, we call it minimax optimal (or simply optimal). If the factor involved is not constant but\nlogarithmic in T and other relevant problem parameters, we call the algorithm minimax near-optimal\n(or simply near-optimal).\nIn the loss/gain setting, losses can be positive or negative. In the loss-only setting, losses are always\npositive: minx\u2208X (cid:96)t(x) \u2265 0 for all t. Zero-order regret bounds involve T , the total duration of the\ngame. In the loss-only setting, a natural notion of the hardness of the adversary sequence is the\nt=1 (cid:96)t(x). Note that L\u2217\ncumulative loss of the best action in hindsight, L\u2217\nT is uniquely\nde\ufb01ned even though the best action in hindsight (denoted by x\u2217\nT ) may not be unique. Bounds that\ndepend on L\u2217\nT instead of T adapt to the hardness of the actual losses encountered and are called\n\ufb01rst-order regret bounds. We will ignore factors logarithmic in T when calling a bound \ufb01rst-order.\nThere are some special cases of online learning that arise frequently enough to have received names.\nIn online convex (resp. linear) optimization, the functions (cid:96)t are convex (resp. linear) and the learner\u2019s\naction set X is a subset of some Euclidean space. In the linear setting, we identify a linear function\nwith its vector representation and write (cid:96)t(x) = (cid:104)(cid:96)t, x(cid:105). In the experts problem, we have X = [N ]\nand we use it instead of xt to denote the learner\u2019s moves. Also we write (cid:96)t(i) = (cid:96)t,i for i \u2208 [N ].\nWe consider both full and partial information settings. In the full information setting, the learner\nobserves the loss function (cid:96)t at the end of each round. In the partial information setting, learners\nreceive less feedback. A common partial information feedback is bandit feedback, i.e., the learner\nonly observes its own loss (cid:96)t(xt). Due to less amount of information available to the learner, deriving\nregret bounds, especially \ufb01rst-order bounds, is more challenging in partial information settings.\n\nNote that, in settings where the losses are linear, we will often use Lt = (cid:80)t\n\ns=1 (cid:96)s to denote the\ncumulative loss vector. In these settings, we caution the reader to distinguish between LT , the \ufb01nal\ncumulative loss vector and the scalar quantity L\u2217\nT .\n\n2.2 Stability notions motivated by differential privacy\n\nThere is a substantial literature on stability-based analysis of statistical learning algorithms. However,\nthere is less work on identifying stability conditions that lead to low regret online algorithms. A few\npapers that attempt to connect stability and online learning are interestingly all unpublished and only\navailable as preprints [32, 35, 36]. To the best of our knowledge no existing work provides a stability\ncondition which has an explicit connection to differential privacy and which is strong enough to use\nin both full information and partial information settings.\nDifferential privacy (DP) was introduced to study data analysis mechanism that do not reveal too\nmuch information about any single instance in a database. In this paper, we will use DP primarily as\na stability notion [15, Sec. 13.2]. DP uses the following divergence to quantify stability. Let P, Q be\ndistributions over some probability space. The \u03b4-approximate max-divergence between P and Q is\n\nD\u03b4\u221e(P, Q) = sup\nP (B)>\u03b4\n\nlog\n\nP (B) \u2212 \u03b4\nQ(B)\n\n,\n\n(1)\n\nwhere the supremum is taken over measurable sets B. When \u03b4 = 0, we drop the superscript \u03b4. If Y\nand Z are random variables, then D\u03b4\u221e(Y, Z) is de\ufb01ned to be D\u03b4\u221e(PY , PZ) where PY and PZ are the\ndistributions of Y and Z. We want to point out that the max-divergence is not a metric, because it is\nasymmetric and does not satisfy the triangle inequality.\nA randomized online learning algorithm maps the loss sequence (cid:96)1:t\u22121 \u2208 Y t\u22121 to a distribution over\nX . We now de\ufb01ne a stability notion for online learning algorithms that quanti\ufb01es how much does the\ndistribution of the algorithm change when a new loss function is seen.\nDe\ufb01nition 2.1 (One-step differential stability w.r.t. a divergence). An online learning algorithm A is\none-step differentially stable w.r.t. a divergence D (abbreviated DiffStable(D)) at level \u0001 iff for any t\nand any (cid:96)1:t \u2208 Y t, we have D(A((cid:96)1:t\u22121),A((cid:96)1:t)) \u2264 \u0001.\nRemark. The classical de\ufb01nition of DP [4] says a randomized algorithm A is (\u0001, \u03b4)-DP if\nD\u03b4\u221e(A((cid:96)1:t),A((cid:96)(cid:48)\n1:t differ by at most one item. We can consider\n(cid:96)1:t\u22121 as (cid:96)(cid:48)\n\n1:t by adding a uninformative loss (e.g., zero loss for any action) in the last item.\n\n1:t) \u2264 \u0001 whenever (cid:96)1:t and (cid:96)(cid:48)\n\n3\n\n\fIn the case when (cid:96)t is a vector, it will be useful to de\ufb01ne a similar notion where the stability level\ndepends on the norm of the last loss vector.\nDe\ufb01nition 2.2 (One-step differential stability w.r.t. a norm and a divergence). An online learning\nalgorithm A is one-step differentially stable w.r.t. a norm (cid:107) \u00b7 (cid:107) and a divergence D (abbreviated\nDiffStable(D,(cid:107)\u00b7(cid:107))) at level \u0001 iff for any t and any (cid:96)1:t \u2208 Y t, we have D(A((cid:96)1:t\u22121),A((cid:96)1:t)) \u2264 \u0001(cid:107)(cid:96)t(cid:107).\nAs we will discuss later, in the partial information setting, the estimated loss vectors can have very\nlarge norms. In such cases, it will be helpful to consider divergences that give a tighter control\ncompared to the max-divergence. De\ufb01ne a new divergence (we call it Tsallis \u03b3-max-divergence)\n\nlog\u03b3(P (B)) \u2212 log\u03b3(Q(B)),\n(cid:40)\n\nif \u03b3 = 1\nif \u03b3 (cid:54)= 1\n\n.\n\nlog\u03b3(x) =\n\nlog(x)\nx1\u2212\u03b3\u22121\n1\u2212\u03b3\n\nD\u221e,\u03b3(P, Q) = sup\nB\n\n(2)\n\nwhere the generalized logarithm log\u03b3, which we call Tsallis \u03b3-logarithm2 is de\ufb01ned for x \u2265 0 as\n\nFor given P and Q, D\u221e,\u03b3(P, Q) is a non-decreasing function of \u03b3 for \u03b3 \u2208 [1, 2] (see Appendix A).\nTherefore, \u03b3 > 1 gives notions stronger than standard differential privacy. We will only consider the\ncase \u03b3 \u2208 [1, 2] in this paper. For the full information setting \u03b3 = 1 (i.e., the standard max divergence)\nsuf\ufb01ces. Higher values of \u03b3 are used only in the partial information settings. While our work shows\nthe importance of the Tsallis max-divergences in the analysis of online learning algorithms, whether\nthey lead to interesting notions of privacy is less clear. Along these lines, note that the de\ufb01nition of\nthese divergences ensures that they enjoy post-processing inequality under deterministic functions\njust like the standard max-divergence (see Appendix A). Our generalization of the max-divergence\ndoes not rest on the use of an approximation parameter \u03b4; hence we will either use D\u03b4\u221e or D\u221e,\u03b3, but\nnever D\u03b4\u221e,\u03b3. Note that we often omit \u03b4 and \u03b3 in cases where the former is 0 and the latter is 1.\n\n3 Full information setting\n\nIn this section, we state a key lemma connecting the differential stability to \ufb01rst-order bounds. The\nlemma is then applied to obtain \ufb01rst-order bounds for OCO and OLO. In Section 3.3, we consider\nthe Gradient Based Prediction Algorithm (GBPA) for the experts problem. There we show how the\nTsallis max-divergence arises in GBPA analysis: a differentially consistent potential leads to a one-\nstep differentially stable algorithm w.r.t. the Tsallis max-divergence (Proposition 3.5). Differential\nconsistency was introduced as a smoothness notion for GBPA potentials by Abernethy et al. [2].\nSkipped proofs appear in Appendix B.\n\n3.1 Key lemma\n\nThe following lemma is a key tool to derive \ufb01rst-order bounds. There are two reasons why this simple\nlemma is so powerful. First, it makes the substantial body of algorithmic work in DP available for the\npurpose of deriving regret bounds. The parameters \u0001, \u03b4 below then come directly from whichever\nalgorithm from the DP literature we decide to use. Second, DP algorithms often add perturbations to\nachieve privacy. In that case, the \ufb01ctitious algorithm A+ becomes the so-called \u201cbe-the-perturbed-\nleader\u201d (BTPL) algorithm [22] whose regret is usually independent of T (but does scale with \u0001, \u03b4).\nOne can generally set \u03b4 to be very small, such as O(1/(BT )), without signi\ufb01cantly sacri\ufb01cing the\nstability level \u0001.\nIn the following lemma we consider taking an algorithm A and modifying it into a \ufb01ctitious algorithm\nA+. This new algorithm has the bene\ufb01t of one-step lookahead: at time t A+ plays the distribution\nA((cid:96)1:t), whereas A would play A((cid:96)1:t\u22121). It is convenient to consider the regret of A+ for the\npurpose of analysis.\nLemma 3.1. Consider the loss-only setting with loss functions bounded by B. Let A be\nDiffStable(D\u03b4\u221e) at level \u0001 \u2264 1. Then we have\nE[Regret(A)T ] \u2264 2\u0001L\u2217\n\nT + 3E[Regret(A+)T ] + \u03b4BT.\n\n2This quantity is often called the Tsallis q-logarithm, e.g., see [8, Chap. 4]\n\n4\n\n\fif using the Gamma distribution then\n\nues of loss Hessian, perturbation distribution either Gamma or Gaussian\n\nAlgorithm 1 Online convex optimization using Obj-Pert by Kifer et al. [23]\n1: Parameters Privacy parameters (\u0001, \u03b4), upper bound \u03b2 on norm of loss gradient, upper bound \u03b3 on eigenval-\n2: for t = 1,\u00b7\u00b7\u00b7 , T do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nSample b \u2208 Rd from a distribution with density f (b) \u221d exp(\u2212 \u0001||b||2\n2\u03b2 )\n\nSample b \u2208 Rd from the multivariate Gaussian N (0, \u03a3) where \u03a3 = \u03b22 log 2\n\nPlay xt = arg minx\u2208X(cid:80)t\u22121\n\nelse if using the Gaussian distribution then\n\ns=1 (cid:96)s(x) + \u03b3\n\n2 + (cid:104)b, x(cid:105)\n\n\u0001 (cid:107)x(cid:107)2\n\nend if\n\n\u03b4 +4\u0001\n\n\u00012\n\nI\n\n3.2 Online optimization: convex and linear loss functions\n\nWe now consider online convex optimization, a canonical problem in online learning, and show how\nto apply a privacy-inspired stability argument to effortlessly convert differential privacy guarantees\ninto online regret bounds. In the theorem below, we build on the privacy guarantees provided by\nKifer et al. [23] for (batch) convex loss minimization, and therefore Algorithm 1 uses their Obj-Pert\n(objective perturbation) method to select moves.\nTheorem 3.2 (First-order regret in OCO). Suppose we are in the loss-only OCO setting. Let X \u2282 Rd,\n(cid:107)X(cid:107)2 \u2264 D and let all loss functions be bounded by B. Further assume that (cid:107)\u2207(cid:96)t(x)(cid:107)2 \u2264 \u03b2,\n\u03bbmax(\u22072(cid:96)t(x)) \u2264 \u03b3 and that the Hessian matrix \u22072(cid:96)t(x) has rank at most one, for every t\nT (\u03b3D2 + \u03b2dD)) and\n\nand x \u2208 X . Then, the expected regret of Algorithm 1 is at most O((cid:112)L\u2217\n(cid:18)(cid:113)\n\n(cid:19)\nT (\u03b3D2 + D(cid:112)d(\u03b22 log(BT )))\n\nwith Gamma and Gaussian perturbations, respectively.\n\nL\u2217\n\nO\n\nProof Sketch. From the DP result by Kifer et al. [23, Theorem 2], we can infer that D\u03b4\u221e(xt, xt+1) \u2264\n\u0001, where \u03b4 becomes zero when using the Gamma distribution. This means that Algorithm 1 enjoys the\none-step differential stability w.r.t. D\u221e (resp. D\u03b4\u221e) in the Gamma (resp. Gaussian) case. The regret\nof the \ufb01ctitious A+ algorithm can be shown to be bounded by \u03b3\n\u0001 D2 + 2DE(cid:107)b(cid:107)2. Using Lemma 3.1,\nwe can deduce that the expected regret of Algorithm 1 is at most\n\n2\u0001L\u2217\n\nT +\n\n3\u03b3\n\u0001\n\nD2 + 6DE(cid:107)b(cid:107)2 + \u03b4BT,\n\n(3)\n\nwhere \u03b4 becomes zero when using the Gamma distribution. We have E(cid:107)b(cid:107)2 = 2d\u03b2\ncase and E(cid:107)b(cid:107)2 \u2264\n\u0001 (setting \u03b4 = 1\n\nin the Gamma\nin the Gaussian case. Plugging these results in (3) and optimizing\n\nBT for the Gaussian case) prove the desired bound.\n\nd(\u03b22 log 2\n\n\u03b4 +4\u0001)\n\n\u221a\n\n\u0001\n\n\u0001\n\nThe rank-one restriction on the Hessian of (cid:96)t, which allows loss curvature in one dimension, is a\nstrong assumption but indeed holds in many common scenarios, e.g., (cid:96)t(x) = \u03c6t((cid:104)x, zt(cid:105)) for some\nscalar loss function \u03c6t and vector zt \u2208 Rd. This is a common situation in online classi\ufb01cation and\nonline regression with linear predictors. Moreover, it seems likely that the rank restriction can be\nremoved in the results of Kifer et al. [23] at the cost of a higher \u0001. A key strength of our approach\nis that we will immediately inherit any future improvements to existing privacy results. Note that\n\ufb01rst-order bounds for smooth convex functions have been shown by Srebro et al. [37]. However, their\nanalysis relies on the self bounding property, i.e., the norm of the loss gradient is bounded by the loss\nitself, which does not hold for linear functions. Even logarithmic rates in L\u2217\nT are available [29] but\nthey rely on extra properties such as exp-concavity. When the functions are linear, (cid:96)t(x) = (cid:104)(cid:96)t, x(cid:105),\nthe restrictions on the Hessian are automatically met. The gradient condition reduces to (cid:107)(cid:96)t(cid:107)2 \u2264 \u03b2\nwhich gives us the following corollary.\nCorollary 3.3 (First-order regret in OLO). Suppose we are in the loss-only OLO setting. Let X \u2282 Rd,\n(cid:107)X(cid:107)2 \u2264 D. Further assume that (cid:107)(cid:96)t(cid:107)2 \u2264 \u03b2, for every t. Then, the expected regret of Algorithm\n\n1 with no (cid:96)2-regularization (i.e., \u03b3 = 0) is at most O((cid:112)L\u2217\n\nT \u03b2D(cid:112)d log(\u03b2DT ))\n\nT d\u03b2D) and O(\n\n(cid:113)\n\nL\u2217\n\nwith Gamma and Gaussian perturbations, respectively.\n\n5\n\n\fAlgorithm 2 Gradient-Based Prediction Algorithm (GBPA) for experts problem\n1: Input: Concave potential \u02dc\u03a6 : RN \u2192 R with \u2207 \u02dc\u03a6 \u2208 \u2206N\u22121\n2: Set L0 = 0 \u2208 RN\n3: for t = 1 to T do\n4:\n5:\n6:\n7: end for\n\nSampling: Choose it \u2208 [N ] according to distribution pt = \u2207 \u02dc\u03a6(Lt\u22121) \u2208 \u2206N\u22121\nLoss: Incur loss (cid:96)t,it and observe the entire vector (cid:96)t\nUpdate: Lt = Lt\u22121 + (cid:96)t\n\nd\n\n\u221a\n\n\u221a\nAbernethy et al. [1] showed that FTPL with Gaussian perturbations is an algorithm applicable to\ngeneral OLO problems with regret O(\u03b2D 4\nT ). However, their analysis technique based on convex\nduality does not lead to \ufb01rst-order bounds in the loss-only setting. To the best of our knowledge, the\nresult above provides a novel \ufb01rst-order bound for OLO when both the learner and adversary sets are\nmeasured in (cid:96)2 norm (the classic FTPL analysis of Kalai and Vempala [22] for OLO uses (cid:96)1 norm).\nNote that L\u2217\nT can be signi\ufb01cantly less than its maximum value \u03b2DT . We also emphasize that the\nbound in Corollary 3.3 depends on the dimension d, which can lead to a loose bound. There are\ndifferent algorithms such as online gradient descent or online mirror descent (e.g., see [20]) whose\nregret bound is dimension-free. It remains as an open question to prove such a dimension-free bound\nfor any FTPL algorithm for OLO.\n\n3.3 Experts problem\n\nWe will now turn our attention to another classical online learning setting of prediction with expert\nadvice [12, 18, 24]. In the experts problem, X = [N ], Y = [0, 1]N and a randomized algorithm plays\na distribution over the N experts. In the remainder of this paper, we will consider discrete sets for the\nplayers moves and so we will use it instead of xt to denote the learner\u2019s move and pt \u2208 \u2206N\u22121 to\ndenote the distribution from which it is sampled.\nThe GBPA family of algorithms is important for the experts problem and for the related problem\nof adversarial bandits (discussed in the next section). It includes as subfamilies FTPL and FTRL\nalgorithms. The main ingredient in GBPA is a potential function \u02dc\u03a6 whose gradient is used to generate\nprobability distributions for the moves. This potential function can be thought of as a smoothed\nversion of the baseline potential \u03a6(L) = mini Li for L \u2208 RN . The baseline potential is non-smooth\nand using it in GBPA would result in the follow-the-leader (FTL) algorithm which is known to\nbe unstable. FTRL and FTPL can be viewed as two distinct ways of smoothing the underlying\nnon-smooth potential. In particular, FTPL uses stochastic smoothing by considering \u02dc\u03a6 of the form\n\u02dc\u03a6D(L) = E[mini(Li \u2212 Zi)] where Zi\u2019s are N i.i.d. draws from the distribution D. FTRL uses\nsmoothed potentials of the form \u02dc\u03a6F (L) = minp ((cid:104)p, L(cid:105) + F (p)) for some strictly convex F .\n\n3.3.1 From differential consistency to one-step differential stability\n\nAbernethy et al. [2] analyzed the GBPA by introducing differential consistency de\ufb01ned below. The\nde\ufb01nition differs slightly from the original because it is formulated here with losses instead of gains.\nThe notations \u22072\nii and \u2207i are used to refer to speci\ufb01c entries in the Hessian and gradient respectively.\nDe\ufb01nition 3.4 (Differential consistency). We say that a function f : RN \u2192 R is (\u03b3, \u0001)-differentially\nconsistent if f is twice-differentiable and \u2212\u22072\n\niif \u2264 \u0001 (\u2207if )\u03b3 for all i \u2208 [N ].\n\nThis functions as a new measure of the potential\u2019s smoothness. Their main idea is to decompose\nthe regret into three penalties [2, Lemma 2.1] and bound one of them when the potential function\nis differentially consistent. In fact, it can be shown that the potentials in many FTPL and FTRL\nalgorithms are differentially consistent, and this observation leads to regret bounds of such algorithms.\nQuite surprisingly, we can establish the one-step stability when the algorithm is the GBPA with a\ndifferentially consistent potential function. To state the proposition, we need to introduce a technical\nde\ufb01nition. We say that a matrix is positive off-diagonal (POD) if its off-diagonal entries are non-\nnegative and its diagonal entries are non-positive. In the FTPL case where F (p, Z) = \u2212(cid:104)p, Z(cid:105), it\nwas already shown by Abernethy et al. [1] that \u2212\u22072 \u02dc\u03a6(L) is POD. It is easy to show that if F (p) =\n\n6\n\n\f(cid:80)\ni f (pi) for a strictly convex and smooth f, then \u2212\u22072 \u02dc\u03a6(L) is always POD (see Appendix B). The\n\nnext proposition connects differential consistency to the one-step differential stability.\nProposition 3.5 (Differential consistency implies one-step differential stability). Suppose \u02dc\u03a6(L) is of\nthe form E[minp(cid:104)L, p(cid:105) + F (p, Z)] and \u03b3 \u2265 1. If \u02dc\u03a6 is (\u03b3, \u0001)-differentially consistent and \u2212\u22072 \u02dc\u03a6 is\nalways POD, the GBPA using \u02dc\u03a6 as potential is DiffStable(D\u221e,\u03b3,(cid:107) \u00b7 (cid:107)\u221e) at level 2\u0001.\n\n3.3.2 Optimal family of FTPL algorithms\n\nWe leverage our result from the previous section to prove that FTPL algorithms with a variety of\nperturbations have the minimax optimal \ufb01rst-order regret bound in the experts problem.\nTheorem 3.6 (First-order bound for experts via FTPL). For the loss-only experts setting, FTPL with\nGamma, Gumbel, Fr\u00e9chet , Weibull, and Pareto perturbations, with a proper choice of distribution\n\nparameters, all achieve the optimal O((cid:112)L\u2217\n\nT log N + log N ) expected regret.\n\nAlthough the result above is not the \ufb01rst optimal \ufb01rst-order bound for the experts problem, such\na bound for FTPL with the wide variety of distributions mentioned above is not found in the\nliterature. Previous FTPL analysis achieving \ufb01rst-order regret bounds all relied on speci\ufb01c choices\nsuch as exponential [22] and dropout [41] perturbations. There are results that consider Gaussian\npertubations [1], random-walk perturbations [14], and a large family of symmetric distributions [34],\nbut they only provide zero-order bounds.\n\n4 Partial information setting\n\nIn this section, we provide stability based analyses of extensions of the GBPA framework to N-armed\nbandits and K-armed bandits with N experts. All omitted proofs can be found in Appendix C.\n\n4.1 GBPA for multi-armed bandits\n\nIn the multi-armed bandit setting, only the loss (cid:96)t,it of the algorithm\u2019s chosen action is revealed.\nThe GBPA for this setting is almost same as Algorithm 2 but with an extra step of loss estimation.\nThe algorithm uses importance weighting to produce estimates \u02c6(cid:96)t = (cid:96)t,it\neit of the actual loss\npt,it\nvectors\u2014these are unbiased as long as pt has full support\u2014and feeds these estimates to the standard\nGBPA using a (smooth) potential \u02dc\u03a6. Algorithm 4 in Appendix C summarizes these steps.\nThe losses fed to the full information GBPA are scaled by 1/pt,it which can be very large. On the\nother hand, there is a special structure in \u02c6(cid:96)t: it has at most one non-zero entry. The following lemma\nis a replacement for the key lemma (Lemma 3.1) that exploits this special structure. The \ufb01rst term\nin the bound is the analogue of the 2\u0001L\u2217\nT term in the key lemma. This term shows that using D\u221e,\u03b3\ninstead of using D\u221e to measure stability can be useful in the bandit setting: the larger \u03b3 is, the less\nproblem we have from the inverse probability weighting inherent in \u02c6(cid:96)t,it. The second term in the\nbound is the analogue of the loss of the \ufb01ctitious algorithm which now depends on the (expected)\nrange of values attained by the (possibly random) function F (p, Z).\nLemma 4.1. Suppose the full information GBPA uses a potential of the form \u02dc\u03a6(L) = E[minp(cid:104)L, p(cid:105)+\nF (p, Z)] and \u03b3 \u2208 [1, 2]. If the full information GBPA is DiffStable(D\u221e,\u03b3,(cid:107) \u00b7 (cid:107)\u221e) at level \u0001, then the\nexpected regret of Algorithm 4 (in Appendix C) can be bounded as\n\n(cid:21)\n\n(cid:34) T(cid:88)\n\nE\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n(cid:20)\n\n(cid:96)t,it\n\n\u2212 L\u2217\n\nT \u2264 \u0001 E\n\n\u02c6(cid:96)2\nt,it\n\np\u03b3\nt,it\n\n+ E\n\nF (p, Z) \u2212 min\n\np\n\nmax\n\np\n\nF (p, Z)\n\n.\n\nt=1\n\nt=1\n\nWe will now use this lemma to analyze a variety of FTPL and FTRL algorithms. Recall that an algo-\nrithm is in the FTPL family when F (p, Z) = \u2212(cid:104)p, Z(cid:105), and is in the FTRL family when F (p, Z) =\nF (p) for some deterministic regularization function F (\u00b7). There is a slight complication in the FTPL\ncase: for a given L computing the probability pt,i = \u2207i \u02dc\u03a6D(L) = P (i = arg mini(cid:48)(Li(cid:48) \u2212 Zi(cid:48))) is\nintractable even though we can easily draw samples from this probability distribution. A method\ncalled Geometric Resampling [27] solves this problem by computing a Monte-Carlo estimate of\n1/pt,i (which is all that is needed to run Algorithm 4). They show that the extra error due to this\n\n7\n\n\fProbabilities over experts via gradient: pt = \u2207 \u02dc\u03a6(\u03c6t\u22121) \u2208 \u2206N\u22121\n\nConvert probabilities from experts to actions: qt = \u03c8t(pt) =(cid:80)K\n\nAlgorithm 3 GBPA for bandits with experts problem\n1: Input: Concave potential \u02dc\u03a6 : RN \u2192 R, with \u2207 \u02dc\u03a6 \u2208 \u2206N\u22121, clipping threshold 0 \u2264 \u03c1 < 1/K\n2: Set \u03c60 = 0 \u2208 RN\n3: for t = 1 to T do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n\nClipping (optional): \u02dcqt = C\u03c1(qt) where C\u03c1 is de\ufb01ned in (4)\nSampling: Choose jt \u2208 [K] according to distribution qt (or \u02dcqt if clipping)\nLoss: Incur loss (cid:96)t,jt and observe this value\nEstimation: Compute an estimate of loss vector \u02c6(cid:96)t =\n\nConvert estimate from actions to experts: \u03c6t(\u02c6(cid:96)t) =(cid:80)K\n\n(cid:80)\nejt \u2208 RK (or (cid:96)t,jt\n\u02dcqt,jt\n\u02c6(cid:96)t,jei \u2208 RN\n\nUpdate: \u03c6t = \u03c6t\u22121 + \u03c6t(\u02c6(cid:96)t)\n\n(cid:96)t,jt\nqt,jt\n\nj=1\n\nj=1\n\ni:Ei,t=j\n\n(cid:80)\ni:Ei,t=j pt,iej \u2208 \u2206K\u22121\n\nejt if clipping)\n\n\u221a\nestimation is at most KT /M, where M is the maximal number of samples per round that we use\nfor the Monte-Carlo simulation. This implies that having M = \u0398(\nT ) for the zero-order bound or\nM = \u0398(T ) for the \ufb01rst-order bound would not affect the order of our bounds. Furthermore, they\nalso prove that the expected number of samples to run the Geometric Resampling is constant per\nround (see [27, Theorem 2]). For simplicity, we will ignore this estimation step and assume the exact\nvalue of pt,it is available to the learner.\nTheorem 4.2 (Zero-order and \ufb01rst-order regret bounds for multi-armed bandits). Algorithm 4 (in\nAppendix C) enjoys the following bounds when used with different perturbations/regularizers:\n1. FTPL with Gamma, Gumbel, Fr\u00e9chet , Weibull, and Pareto pertubations (with a proper choice of\n\ndistribution parameters) all achieve near-optimal expected regret of O(\n\nN T log N ).\n\n\u221a\n\n2. FTRL with Tsallis neg-entropy F (p) = \u2212\u03b7(cid:80)N\n3. FTRL with log-barrier regularizer F (p) = \u2212\u03b7(cid:80)N\nexpected regret of O((cid:112)N L\u2217\n\n\u221a\nchoice of \u03b7) achieves optimal expected regret of O(\n\nT log(N T ) + N log(N T )).\n\nN T ).\n\ni=1 pi log\u03b1(1/pi) for 0 < \u03b1 < 1 (with a proper\n\ni=1 log pi (with a proper choice of \u03b7) achieves\n\nThe proofs of the above results use the one-step differential stability as the unifying theme: in Part\n1 we establish stability w.r.t. D\u221e, in Part 2, w.r.t. D\u221e,2\u2212\u03b1, and in Part 3, w.r.t. D\u221e,2. Parts 1-2\nessentially rederive the results of Abernethy et al. [2] in the differential stability framework. Part\n3 is quite interesting since it uses the strongest stability notion used in this paper (w.r.t. D\u221e,2).\nFirst-order regret bounds for multi-armed bandits have been obtained via specialized analysis several\ntimes [6, 26, 33, 38]. Such is the obscure nature of these analyses that in one case the authors\nclaimed novelty without realizing that earlier \ufb01rst-order bounds existed! The intuition behind such\nanalyses remained a bit unclear. Our analysis of the log-barrier regularizer clearly indicates why it\nenjoys \ufb01rst-order bounds (ignoring log(N T ) term): the resulting full information algorithm enjoys a\nparticularly strong form of one-step differential stability.\n\n4.2 GBPA for bandits with experts\n\nWe believe that our uni\ufb01ed differential stability based analysis of adversarial bandits can be extended\nto more complex partial information settings. We provide evidence for this by considering the\nproblem of adversarial bandits with experts. In this more general problem, which was introduced in\nthe same seminal work that introduced the adversarial bandits problem [9], there are K actions and N\nexperts, E1, . . . , EN , that at each round t, give advice on which of K actions to take. The algorithm\nis supposed to combine their advice to pick a distribution qt \u2208 \u2206K\u22121 and chooses an action jt \u223c qt.\nDenote the suggestion of the ith expert at time t as Ei,t \u2208 [K]. Expected regret in this problem is\n\nT is now de\ufb01ned as L\u2217\n\nT = minN\ni=1\n\nt=1 (cid:96)t,Ei,t.\n\n(cid:80)T\n\n(cid:105) \u2212 L\u2217\n\nde\ufb01ned as E(cid:104)(cid:80)T\na distribution over actions: \u03c8t(pt) =(cid:80)K\n\nt=1 (cid:96)t,jt\n\nT , where L\u2217\n\n(cid:80)\n\nThe GBPA for this setting has a few more ingredients in it compared to the one for the multi-armed\nbandits. First, a transformation \u03c8t to convert pt \u2208 \u2206N\u22121, a distribution over experts, to qt \u2208 \u2206K\u22121,\ni:Ei,t=j pt,iej , where ej is the jth basis vector in RK.\nNote that the probability assigned to each action is the sum of the probabilities of all the experts that\n\nj=1\n\n8\n\n\f\u02c6(cid:96)t,jt = (cid:96)t,jt /qt,jt (and zero for j (cid:54)= jt) into a loss estimate in RN : \u03c6t(\u02c6(cid:96)t) =(cid:80)K\n\nrecommended that action. Second, a transformation \u03c6t to convert the loss estimate \u02c6(cid:96)t \u2208 RK de\ufb01ned by\n\u02c6(cid:96)t,jei,\nwhere ei is the ith basis vector in RN . At time t, the full information algorithm\u2019s output pt is used to\nselect the action distribution qt = \u03c8t(pt) and the full information algorithm is fed \u03c6t(\u02c6(cid:96)t) to update pt.\nNote that \u03c8t and \u03c6t are de\ufb01ned such that (cid:104)\u03c8t(p), \u02c6(cid:96)(cid:105) = (cid:104)p, \u03c6t(\u02c6(cid:96))(cid:105) for any p \u2208 \u2206N\u22121 and any \u02c6(cid:96) \u2208 RK\n+ .\nLastly, the clipping function C\u03c1 : \u2206K\u22121 \u2192 \u2206K\u22121 is de\ufb01ned as:\n\n(cid:80)\n\ni:Ei,t=j\n\nj=1\n\n(cid:40)\n\n[C\u03c1(q)]j =\n\n1\u2212(cid:80)\n\n0\n\nqj\nj(cid:48):q(cid:48)\n\nj\n\n<\u03c1 qj(cid:48)\n\nif qj \u2265 \u03c1\nif qj < \u03c1\n\n.\n\n(4)\n\nIt sets the probability weights that are less than \u03c1 to 0 and scales the rest to make it a distribution.\nThe clipping step (step 6) is optional. In fact, we can prove the zero-order bound without clipping,\nbut this step becomes crucial to show the \ufb01rst-order bound. The main intuition is to bound the size of\nthe loss estimate \u02c6(cid:96)t. With clipping, we can ensure (cid:107)\u02c6(cid:96)t(cid:107)\u221e \u2264 1/\u03c1 for all t, which provides a better\ncontrol on the one-step stability. The regret bounds in bandits with experts setting appear below.\nTheorem 4.3 (Zero-order and \ufb01rst-order regret bounds for bandits with experts). Algorithm 3 enjoys\nthe following bounds when used with different perturbations such as Gamma, Gumbel, Fr\u00e9chet ,\nWeibull, and Pareto (with a proper choice of parameters).\n\u221a\n1. With no clipping, it achieves near-optimal expected regret of O(\n(K log N )1/3 (L\u2217\n2. With clipping, it achieves expected regret of O\n\nKT log N ).\n\nT )2/3(cid:17)\n\n(cid:16)\n\n.\n\nO((cid:112)L\u2217\n\nThe zero-order bound in Part 1 above was already shown for the celebrated EXP4 algorithm by Auer\net al. [9]. Furthermore, Agarwal et al. [3] proved that, with clipping, EXP4 also enjoys a \ufb01rst-order\nbound with O((L\u2217\nT )2/3) dependence. Our theorem shows that EXP4 is not special in enjoying these\nbounds. The same bounds continue to hold for a variety of perturbation based algorithms. Such a\nresult does not appear in the literature to the best of our knowledge. We note here that achieving\nT ) bounds in this setting was posed as on open problem by Agarwal et al. [3]. This problem\nwas recently solved by the algorithm MYGA [5]. MYGA does achieve the optimal \ufb01rst-order bound,\nbut the algorithm is not simple in that it has to maintain \u0398(T ) auxiliary experts in every round. In\ncontrast, our algorithms are simple as they are all instances of GBPA along with the clipping idea.\n\nAcknowledgments\n\nPart of this work was done while AM was visiting the Simons Institute for the Theory of Comput-\ning. AM was supported by NSF grant CCF-1763786, a Sloan Foundation Research Award, and a\npostdoctoral fellowship from BU\u2019s Hariri Institute for Computing. AT and YJ were supported by\nNSF CAREER grant IIS-1452099. AT was also supported by a Sloan Research Fellowship. JA was\nsupported by NSF CAREER grant IIS-1453304.\n\nReferences\n[1] Jacob Abernethy, Chansoo Lee, Abhinav Sinha, and Ambuj Tewari. Online linear optimization\n\nvia smoothing. In Conference on Learning Theory, pages 807\u2013823, 2014.\n\n[2] Jacob D Abernethy, Chansoo Lee, and Ambuj Tewari. Fighting bandits with a new kind of\nsmoothness. In Advances in Neural Information Processing Systems, pages 2197\u20132205, 2015.\n\n[3] Alekh Agarwal, Akshay Krishnamurthy, John Langford, and Haipeng Luo. Open problem:\nFirst-order regret bounds for contextual bandits. In Conference on Learning Theory, pages 4\u20137,\n2017.\n\n[4] Naman Agarwal and Karan Singh. The price of differential privacy for online learning. In\n\nInternational Conference on Machine Learning, 2017.\n\n[5] Zeyuan Allen-Zhu, Sebastien Bubeck, and Yuanzhi Li. Make the minority great again: First-\norder regret bound for contextual bandits. In International Conference on Machine Learning,\npages 186\u2013194, 2018.\n\n9\n\n\f[6] Chamy Allenberg, Peter Auer, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and Gy\u00f6rgy Ottucs\u00e1k. Hannan consistency in on-\nline learning in case of unbounded losses under partial monitoring. In International Conference\non Algorithmic Learning Theory, pages 229\u2013243. Springer, 2006.\n\n[7] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private PAC learning implies\n\ufb01nite Littlestone dimension. In Proceedings of the 51st Annual ACM SIGACT Symposium on\nTheory of Computing, pages 852\u2013860. ACM, 2019.\n\n[8] Shun-ichi Amari. Information geometry and its applications. Springer, 2016.\n\n[9] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic\n\nmultiarmed bandit problem. SIAM journal on computing, 32(1):48\u201377, 2002.\n\n[10] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman.\n\nAlgorithmic stability for adaptive data analysis. In STOC, 2016.\n\n[11] Olivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. Journal of machine learning\n\nresearch, 2(Mar):499\u2013526, 2002.\n\n[12] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university\n\npress, 2006.\n\n[13] Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu. Adaptive\nlearning with robust generalization guarantees. In Conference on Learning Theory, pages 23\u201326,\n2016.\n\n[14] Luc Devroye, G\u00e1bor Lugosi, and Gergely Neu. Prediction by random-walk perturbation. In\n\nConference on Learning Theory, pages 460\u2013473, 2013.\n\n[15] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Founda-\n\ntions and Trends in Theoretical Computer Science, 9(3-4):211\u2013407, 2014.\n\n[16] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. Boosting and differential privacy. In IEEE\n\n51st Annual Symposium on Foundations of Computer Science, pages 51\u201360, 2010.\n\n[17] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon\n\nRoth. Preserving statistical validity in adaptive data analysis. In STOC, 2015.\n\n[18] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning\nand an application to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n\n[19] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: stability of\n\nstochastic gradient descent. In International Conference on Machine Learning, 2016.\n\n[20] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimiza-\n\ntion, 2(3-4):157\u2013325, 2016.\n\n[21] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning.\n\nIn Conference on Learning Theory, pages 24\u20131, 2012.\n\n[22] Adam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal\n\nof Computer and System Sciences, 71(3):291\u2013307, 2005.\n\n[23] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization\n\nand high-dimensional regression. In Conference on Learning Theory, pages 25\u20131, 2012.\n\n[24] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and\n\ncomputation, 108(2):212\u2013261, 1994.\n\n[25] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS, 2007.\n\n[26] Gergely Neu. First-order regret bounds for combinatorial semi-bandits. In Conference on\n\nLearning Theory, pages 1360\u20131375, 2015.\n\n[27] Gergely Neu and G\u00e1bor Bart\u00f3k. An ef\ufb01cient algorithm for learning with semi-bandit feedback.\n\nIn International Conference on Algorithmic Learning Theory, pages 234\u2013248, 2013.\n\n10\n\n\f[28] Kobbi Nissim and Uri Stemmer. On the generalization properties of differential privacy. arXiv\n\npreprint arXiv:1504.05800, 2015.\n\n[29] Francesco Orabona, Nicolo Cesa-Bianchi, and Claudio Gentile. Beyond logarithmic bounds in\n\nonline learning. In Arti\ufb01cial Intelligence and Statistics, pages 823\u2013831, 2012.\n\n[30] Jean-Paul Penot. Sub-hessians, super-hessians and conjugation. Nonlinear Analysis: Theory,\n\nMethods & Applications, 23(6):689\u2013702, 1994.\n\n[31] Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee, and Partha Niyogi. General conditions for\n\npredictivity in learning theory. Nature, 428(6981):419, 2004.\n\n[32] Tomaso Poggio, Stephen Voinea, and Lorenzo Rosasco. Online learning, stability, and stochastic\n\ngradient descent. arXiv preprint arXiv:1105.4701, 2011.\n\n[33] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In\n\nConference on Learning Theory, 2013.\n\n[34] Sasha Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to\nalgorithms. In Advances in Neural Information Processing Systems, pages 2141\u20132149, 2012.\n\n[35] St\u00e9phane Ross and J Andrew Bagnell. Stability conditions for online learnability. arXiv preprint\n\narXiv:1108.3154, 2011.\n\n[36] Ankan Saha, Prateek Jain, and Ambuj Tewari. The interplay between stability and regret in\n\nonline learning. arXiv preprint arXiv:1211.6158, 2012.\n\n[37] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In\n\nAdvances in Neural Information Processing Systems, pages 2199\u20132207, 2010.\n\n[38] Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences.\n\nPhD thesis, Universit\u00e9 Paris Sud-Paris XI, 2005.\n\n[39] Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online\nlearning in full-information and bandit settings. In Advances in Neural Information Processing\nSystems, pages 2733\u20132741, 2013.\n\n[40] Aristide Charles Yedia Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial\n\nmulti-armed bandit. In AAAI, 2017.\n\n[41] Tim Van Erven, Wojciech Kot\u0142owski, and Manfred K Warmuth. Follow the leader with dropout\n\nperturbations. In Conference on Learning Theory, pages 949\u2013974, 2014.\n\n[42] Bin Yu. Stability. Bernoulli, 19(4):1484\u20131500, 2013.\n\n11\n\n\f", "award": [], "sourceid": 4777, "authors": [{"given_name": "Jacob", "family_name": "Abernethy", "institution": "Georgia Institute of Technology"}, {"given_name": "Young Hun", "family_name": "Jung", "institution": "University of Michigan"}, {"given_name": "Chansoo", "family_name": "Lee", "institution": "Google"}, {"given_name": "Audra", "family_name": "McMillan", "institution": "Northeastern/Boston University"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}