{"title": "User-Specified Local Differential Privacy in Unconstrained Adaptive Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 14103, "page_last": 14112, "abstract": "Local differential privacy is a strong notion of privacy in which the provider of the data guarantees privacy by perturbing the data with random noise. In the standard application of local differential differential privacy the distribution of the noise is constant and known by the learner. In this paper we generalize this approach by allowing the provider of the data to choose the distribution of the noise without disclosing any parameters of the distribution to the learner, under the constraint that the distribution is symmetrical. We consider this problem in the unconstrained Online Convex Optimization setting with noisy feedback. In this setting the learner receives the subgradient of a loss function, perturbed by noise, and aims to achieve sublinear regret with respect to some competitor, without constraints on the norm of the competitor. We derive the first algorithms that have adaptive regret bounds in this setting, i.e. our algorithms adapt to the unknown competitor norm, unknown noise, and unknown sum of the norms of the subgradients, matching state of the art bounds in all cases.", "full_text": "User-Speci\ufb01ed Local Differential Privacy in\nUnconstrained Adaptive Online Learning\n\nDirk van der Hoeven\nMathematical Institute\n\nLeiden University\nLeiden, 2333 CA\n\ndirkvderhoeven@gmail.com\n\nAbstract\n\nLocal differential privacy is a strong notion of privacy in which the provider of\nthe data guarantees privacy by perturbing the data with random noise. In the\nstandard application of local differential privacy the distribution of the noise is\nconstant and known by the learner. In this paper we generalize this approach by\nallowing the provider of the data to choose the distribution of the noise without\ndisclosing any parameters of the distribution to the learner, under the constraint\nthat the distribution is symmetrical. We consider this problem in the unconstrained\nOnline Convex Optimization setting with noisy feedback. In this setting the learner\nreceives the subgradient of a loss function, perturbed by noise, and aims to achieve\nsublinear regret with respect to some competitor, without constraints on the norm\nof the competitor. We derive the \ufb01rst algorithms that have adaptive regret bounds\nin this setting, i.e. our algorithms adapt to the unknown competitor norm, unknown\nnoise, and unknown sum of the norms of the subgradients, matching state of the art\nbounds in all cases.\n\n1\n\nIntroduction\n\nIn learning, a natural tension exists between learners and the providers of data. The learner aims to\nmake optimal use of the data, perhaps even at the cost of the privacy of the providers. To nevertheless\nensure suf\ufb01cient privacy the provider can add random noise to the data that he sends to the learner.\nThis idea is called \u0001-local differential privacy (Wasserman and Zhou, 2010; Duchi et al., 2014) and the\nstandard implementation has constant \u0001 for all providers. However, not all providers care equivalently\nabout their privacy (Song et al., 2015). Some providers may wish to aid the learner in making optimal\nuse of their data, while other providers value their privacy over helping the learner. For instance,\ncelebrities might care more for their privacy than others because they want to preserve the privacy\nthey have left. To complicate things further, the providers of the data may not wish to reveal how\nmuch they care about their privacy, because when privacy levels differ between providers these\nprivacy levels become privacy sensitive themselves. Furthermore, not all parts of the data are equally\nprivacy sensitive. For example, tweets are already publicly available, but browsing history may\ncontain sensitive information that should be kept private. To capture these varying privacy constraints\nwe allow each provider to choose how much noise is added for each dimension of the data.\nIn this paper, we consider these problems in the Online Convex Optimization (OCO) setting (Hazan,\n2016) with local differential privacy guarantees. The OCO framework is a popular and successful\nframework to design and analyse many algorithms used to train machine learning models. The OCO\nsetting proceeds in rounds t = 1, . . . , T . In a given round t the learner is to provide a prediction\nwt \u2208 Rd. An adversary then chooses a convex loss function (cid:96)t and sends a subgradient gt \u2208 \u2202(cid:96)t(wt)\nto the learner. We work with an unconstrained domain for w, which has recently grown in popularity\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(see McMahan and Orabona (2014); Foster et al. (2015); Orabona and P\u00e1l (2016); Foster et al. (2017);\nCutkosky and Boahen (2017); Kot\u0142owski (2017); Cutkosky and Orabona (2018); Foster et al. (2018);\nJun and Orabona (2019)). We aim to develop online learning methods that make the best use of\ndata providers who wish to help the learner while at the same time guaranteeing the desired level of\nprivacy for providers that care about their privacy, without knowing how much each each provider\nadds to the data.\nWe consider the local differential privacy model with varying levels of privacy unknown to the\nlearner. Differential privacy (Dwork and Roth, 2014) is a privacy model that is used in many recent\nmachine-learning applications. The local differential privacy model is a variant of differential privacy\nin which the learner can only access the data of the provider via noisy estimates (Wasserman and\nZhou, 2010; Duchi et al., 2014). The local differential privacy model with varying levels of privacy\nappeared before in Song et al. (2015), but with known levels of noise and only two levels of noise.\nLearning in our setting is modelled by the OCO framework with noisy estimates of the subgradient\n(see also Jun and Orabona (2019)). To ensure local differential privacy the provider adds zero-mean\nnoise \u03bet \u2208 Rd to the subgradient gt. The learner then receives the perturbed subgradient \u02dcgt = gt + \u03bet.\nWe allow each \u03bet to follow a different distribution each round to satisfy different privacy guarantees.\nIn the standard OCO framework the goal of the learner is to minimize the regret with respect to some\nparameter u \u2208 Rd:\n\nT(cid:88)\n\nt=1\n\nRT (u) =\n\n(cid:96)t(wt) \u2212 (cid:96)t(u).\n\nE[RT (u)] = O\n\nE[(cid:107)u(cid:107)\n\n(cid:107)\u02dcgt(cid:107)2\n\n(2)\n\nt=1\n\n(cid:63)] \u2264 3 E[(cid:107)\u03bet(cid:107)2\n\n(cid:63)] + 3 E[(cid:107)gt(cid:107)2\nt=1 (cid:107)gt(cid:107)2\n\nwhich implies (1) via Jensen\u2019s inequality and E[(cid:107)\u02dcgt(cid:107)2\n\nmotivated by work in the noiseless setting, where O((cid:107)u(cid:107)(cid:113)(cid:80)T\n\n(cid:63)]. This bound was\n(cid:63) ln(1 + (cid:107)u(cid:107)T ))) bounds\nare possible (Cutkosky and Orabona, 2018). With these type of bounds, when the sum of the\nsquared norms of the subgradients is small the regret is also small. To achieve (2) we require two\nassumptions: bounded (cid:107)gt(cid:107)(cid:63) and zero-mean symmetrical noise \u03bet. The assumption on gt is common\nin standard OCO. The symmetrical noise assumption is satis\ufb01ed for common mechanisms to ensure\nlocal differential privacy. The dependence on E[(cid:107)\u03bet(cid:107)2\n(cid:63)] is unimprovable, which is shown\nby the lower bound for this setting by Jun and Orabona (2019).\nThe algorithms in this paper are built using the recently developed wealth-regret duality approach\n(Mcmahan and Streeter, 2012). We provide two algorithms. The \ufb01rst achieves the bound in (2). The\nsecond algorithm satis\ufb01es (2) for each dimension separately. This second algorithm can exploit sparse\nprivacy structures, which combined with sparse subgradients yields low expected regret bounds.\n\n(cid:63)] and E[(cid:107)gt(cid:107)2\n\n2\n\nHowever, since the learner receives perturbed subgradients we consider the expected regret E[R(u)],\nwhere the expectation is over the randomness in wt due to the noisy subgradients. The setting\nwill be formally introduced in section 2. Because \u02dcgt \u2208 Rd, standard algorithms for unconstrained\ndomains do not work since they require bounded \u02dcgt. Initial work in this setting by Jun and Orabona\n(2019) was motivated by a lower bound of Cutkosky and Boahen (2017), which shows that one\ncan suffer an exponential penalty when both the domain and subgradients are unbounded. They\nreplace the boundedness assumption on \u02dcgt by a boundedness assumption on E[\u02dcgt] and an assumption\non the tails of the noise distribution. Jun and Orabona (2019) achieved expected regret guarantees\n(cid:63)], G2 is a\n(cid:63), and (cid:107) \u00b7 (cid:107) and (cid:107) \u00b7 (cid:107)(cid:63) are dual norms. This bound is useful when the\nuniform upper bound on (cid:107)gt(cid:107)2\ndistribution of the noise is constant and known and an adversary selects gt. We derive an algorithm\nthat satis\ufb01es\n\nof O((cid:107)u(cid:107)(cid:112)(G2 + \u03c32)T ln(1 + (cid:107)u(cid:107)T )), where \u03c32 is a uniform upper bound on E[(cid:107)\u03bet(cid:107)2\n\n(cid:33)\n\nE[RT (u)] = O\n\n(cid:107)u(cid:107)\n\nt ) ln(1 + (cid:107)u(cid:107)T ))\n\u03c32\n\n,\n\n(1)\n\nt = E[(cid:107)\u03bet(cid:107)2\n\nwhere \u03c32\nsmall. In fact, we will prove something stronger than (1):\n\n(cid:63)]. This bound can be smaller in cases where only a few \u03c3t are large but most are\n\n(cid:32)\n\n(cid:32)\n\nT(cid:88)\n\nt=1\n\n(cid:118)(cid:117)(cid:117)(cid:116)(G2T +\n(cid:118)(cid:117)(cid:117)(cid:116) T(cid:88)\n\n(cid:33)\n(cid:63) ln(1 + (cid:107)u(cid:107)T ))]\n\n,\n\n\fContributions We extend the known results in several directions. Many common local differential\nprivacy applications use symmetric additive noise (laplace mechanism, normal mechanism). We use\nthe symmetry of the noise to adapt to unknown levels of privacy and achieve adaptive expected regret\nbounds. We also adapt to privacy for dimension speci\ufb01c dimension requirements, again without\nrequiring knowledge of the structure of the noise other than symmetry in each dimension. Our\nalgorithms interpolate between no noise and maximum noise, matching state of the art bounds in both\ncases. This can reduce the cost of privacy in some cases, outlined in section 4. Our work partially\nanswers two problems left open by Jun and Orabona (2019). The \ufb01rst question asks whether or not\ndata-dependent bounds are possible in the noisy OCO setting, which we answer af\ufb01rmatively. The\nsecond question is how to adapt to different levels of noise without using extra parameters compared\nto the noiseless setting, which we do for symmetric noise.\n\nt=1 (cid:107)\u02dcgt(cid:107)2\n\n(cid:113)(cid:80)T\n\nRelated work There has been signi\ufb01cant work on unconstrained and adaptive methods in OCO\nwith noiseless subgradients gt Foster et al. (2015); Orabona and P\u00e1l (2016); Foster et al. (2017);\nCutkosky and Boahen (2017); Kot\u0142owski (2017); Cutkosky and Orabona (2018); Foster et al. (2018).\nHowever, these results do not extend to the setting with noisy unbounded subgradients \u02dcgt, which is\npossible with our work. For bounded domains regret bounds of O(D\n(cid:63)) are possible\nwithout knowledge of the noise (Duchi et al., 2011; Orabona and P\u00e1l, 2018), where D is an upper\nbound on (cid:107)u(cid:107). However, these bounds do not adapt to unknown (cid:107)u(cid:107), which may be costly for large\nD but small (cid:107)u(cid:107). We provide an algorithm that both scales with (cid:107)u(cid:107) instead of D and does not\nrequire knowledge of the noise.\nThere is a body of literature in the differential privacy setting with online feedback (Jain et al., 2012;\nJain and Thakurta, 2014; Thakurta and Smith, 2013; Agarwal and Singh, 2017; Abernethy et al.,\n2017). In this paper we consider local differential privacy (Wasserman and Zhou, 2010; Duchi et al.,\n2014), which is a stronger notion of privacy than differential privacy. Duchi et al. (2014) provide\nan algorithm with constant local differential privacy that learns by using SGD. (Song et al., 2015)\nderive how to use knowledge of several levels of local differential privacy for SGD, but only with\ntwo different levels of noise. Jun and Orabona (2019) consider local privacy with an unbounded\ndomain and constant noise. With knowledge of the noise it is possible to extend the results of Jun and\nOrabona (2019) to achieve (1), but not (2).\n\nOutline\nIn section 2 we introduce our problem formally and introduce the key techniques. In\nsection 3 we derive a one-dimensional algorithm that achieves our goals, which we use in a black-\nbox reduction in section 3.1 and we apply it coordinate-wise in section 3.2. Section 4 contains\ntwo scenarios in which our new algorithm achieves improvements compared to current algorithms.\nFinally, in section 5 we present our conclusions.\n\n2 Problem Formulation and Preliminaries\n\nIn this section we describe our notation, introduce the version of local differential privacy we use,\nbrie\ufb02y introduce the OCO setting with noisy subgradients, and provide some background to the\nreward-regret duality paradigm.\n\nNotation. Random variable x is called symmetric if the density function \u03c1 of the random variable\nz = x \u2212 E[x] satis\ufb01es \u03c1(z) = \u03c1(\u2212z). The inner product between vectors g \u2208 Rd and w \u2208 Rd\nis denoted by (cid:104)w, g(cid:105). The Fenchel conjugate of a convex function F , F (cid:63) is de\ufb01ned as F (cid:63)(w) =\nsupg(cid:104)w, g(cid:105) \u2212 F (g). (cid:107) \u00b7 (cid:107) denotes a norm and (cid:107)g(cid:107)(cid:63) = supw:(cid:107)w(cid:107)\u22641(cid:104)w, g(cid:105) denotes the dual norm.\ngt,j indicates the jth component of vector gt.\n\n2.1 User-Speci\ufb01ed Local Differential Privacy\n\nIn the local differential privacy setting each datum is kept private from the learner. The standard\nde\ufb01nition of local privacy requires a randomiser R that perturbs gt with random noise \u03bet, where\n\u03be1, . . . , \u03beT are independently distributed (Wasserman and Zhou, 2010; Kasiviswanathan et al., 2011;\nDuchi et al., 2014). The amount of perturbation is controlled by \u0001, where smaller \u0001 means more\nprivacy. We allow the provider to specify his desired level of privacy, so in a given round t we have\n\u0001t-local differential privacy.\n\n3\n\n\fDe\ufb01nition 1. [Duchi et al. (2014)] Let A = (X1, . . . , XT ) be a sensitive dataset where each\nXt \u2208 A corresponds to data about individual t. A randomiser R which outputs a disguised version\nof S = (U1, . . . , UT ) of A is said to provide \u0001-local differential privacy to individual t, if for all\nx, x(cid:48) \u2208 A and for all S \u2286 S,\n\nPr(Ut \u2208 S|Xt = x) \u2264 exp(\u0001) Pr(Ut \u2208 S|Xt = x(cid:48)).\n\nIn this paper we make use of randomisers of the form Rt(gt) = gt + \u03bet, where \u03bet is generated\nby a zero-mean symmetrical distribution \u03c1t. A common choice for \u03c1t is \u03c1t(z) \u221d exp(\u2212 \u0001t\n2 (cid:107)z(cid:107))\n(Song et al., 2015). This randomiser is \u0001t-local differentially private for (cid:107)gt(cid:107) \u2264 1 (Song et al., 2015,\nTheorem 1). We make use a small variation of this randomiser, which we call the local Laplace\nj=1 \u03c4t,j = \u0001t, \u03c4t,j \u2265 0. The following result\nshows that the local Laplace randomiser preserves \u0001t-local differential privacy.\nLemma 1. Suppose |gt,j| \u2264 1, then the local Laplace randomiser is \u0001t-local differentially private,\n\nrandomiser: \u03c1t(z) \u221d exp(\u2212(cid:80)d\nwhere \u0001t =(cid:80)d\n\n2 |zj|), where(cid:80)d\n\nj=1\n\n\u03c4t,j\n\nj=1 \u03c4t,j.\n\nThe proof follows from applying Theorem 1 of Song et al. (2015) to each dimension and summing\nthe \u03c4t,j. For completeness the proof is provided in Appendix A. This randomiser is the Laplace\nrandomiser (Dwork and Roth, 2014) applied to each dimension with a possibly different \u0001 per\ndimension. The local Laplace randomiser gives the user more control over the details of the privacy\nguarantees: with the local Laplace randomiser each dimension j is \u03c4t,j-local differentially private.\nThis can also lead to lower regret in some cases, of which we give an example in section 4.\n\n2.2 Online Convex Optimization with Noisy Subgradients\n\nThe analysis of many ef\ufb01cient online learning tools has been in\ufb02uenced by the Online Convex\nOptimization framework. As mentioned in the introduction, the OCO setting with noisy subgradients\nproceeds in rounds t = 1, . . . , T . In each round t\n\n1. The learner sends wt \u2208 Rd to the provider of the tth subgradient.\n2. The provider samples \u03bet from zero-mean and symmetrical \u03c1t and computes subgradient\n3. The provider sends \u02dcgt = gt + \u03bet \u2208 Rd to the learner.\n\ngt \u2208 \u2202(cid:96)t(wt), where (cid:107)gt(cid:107)(cid:63) \u2264 G.\n\nThis protocol is a slight adaptation of the protocol of Duchi et al. (2014), where we allow a different\n\u03c1t in each round t instead of using a constant \u03c1. In each round the provider only sends \u02dcgt to the\nlearner. The learner has no information about \u03c1t other than that \u03c1t is symmetrical and zero-mean.\nAlso note that \u03c1t is allowed to change with each round t, complicating things even further. Since\nthe feedback the learner receives is random we are interested in the expected regret. To bound the\nexpected regret we upper bound the losses by their tangents:\n(cid:104)wt \u2212 u, gt(cid:105)] = E[\n\nE[RT (u)] \u2264 E[\n\n(cid:104)wt \u2212 u, \u02dcgt(cid:105)],\n\nT(cid:88)\n\nT(cid:88)\n\n(3)\n\nwhere the equality holds because of the law of total expectation. The analysis focusses on bounding\nthe r.h.s of (3), which is a standard approach in OCO. In the following we introduce a recently\npopularized method to control the regret when wt and u are unbounded.\n\nt=1\n\nt=1\n\n2.3 Reward Regret Duality\n\ncT \u2208 R, then E[RT (u)] \u2264 E[cT ] + F (cid:63)\n\nFor noisy \u02dcgt, the formal result is found in the following lemma (see also Theorem 3 of Jun and\nOrabona (2019)).\nt=1 \u02dcgt) \u2212 cT ] for some convex function FT and\n\nLemma 2. If \u2212 E[(cid:80)T\nProof. From the de\ufb01nition of Fenchel conjugates we have E[FT (\u2212(cid:80)T\n(cid:80)T\nt=1(cid:104)u, \u02dcgt(cid:105)] = \u2212F (cid:63)\n\nt=1(cid:104)wt, gt(cid:105)] \u2265 E[FT (\u2212(cid:80)T\nT (u) \u2212(cid:80)T\n\nt=1(cid:104)wt, gt(cid:105)] \u2265 E[FT (\u2212(cid:80)T\n\nt=1(cid:104)u, gt(cid:105). Using \u2212 E[(cid:80)T\n\nT (u) \u2212\nt=1 \u02dcgt) \u2212 cT ]\n\nt=1 \u02dcgt)] \u2265 E[\u2212F (cid:63)\n\nT (u).\n\nand reordering the terms completes the proof.\n\n4\n\n\ft=1 \u02dcgt(cid:107)2\n\nt=1 \u02dcgt) = \u03b7\n\n2 and cT =(cid:80)T\n\n2(cid:107)(cid:80)T\ns=1(cid:104)ws, gs(cid:105)] \u2265 E[Ft(\u2212(cid:80)t\n\nwith learning rate \u03b7 to \ufb01nd FT (\u2212(cid:80)T\nassuming that \u2212 E[(cid:80)t\nsatisfy Ft\u22121(x) \u2212 (cid:104)wt, gt(cid:105) \u2265 E\u02dcgt[Ft(x \u2212 \u02dcgt)], \u2212 E[(cid:80)T\n\nThe dif\ufb01culty lies in \ufb01nding a suitable FT and cT . For example, we could use gradient descent\n2. However,\nit would be impossible to tune \u03b7 optimally due to the dependence on the unknown u in F \u2217\nT (u) =\n2\u03b7(cid:107)u(cid:107)2\n2. For noiseless subgradients gt (Cutkosky and Orabona, 2018) provide a route to \ufb01nd a\n1\nsuitable FT , with a constant cT . Jun and Orabona (2019) extend this idea to noisy subgradients\n\u02dcgt: one needs to \ufb01nd an Ft, Ft\u22121, and wt that satisfy Ft\u22121(x) \u2212 (cid:104)wt, gt(cid:105) \u2265 E\u02dcgt[Ft(x \u2212 \u02dcgt)]. By\ns=1 \u02dcgs)] holds one can show that if Ft and Ft\u22121\nt=1 \u02dcgt)] holds by\ninduction. The result is given in the following lemma, of which the proof can be found in Appendix\nA.\nLemma 3. Suppose that Ft\u22121(x) \u2212 (cid:104)wt, gt(cid:105) \u2265 E\u02dcgt[Ft(x \u2212 \u02dcgt)] holds for all t, then\n\nt=1(cid:104)wt, gt(cid:105)] \u2265 E[FT (\u2212(cid:80)T\n\n2(cid:107)\u02dcgt(cid:107)2\n\nt=1\n\n\u03b7\n\nT(cid:88)\n\n(cid:104)wt, gt(cid:105)] \u2265 E[FT (\u2212 T(cid:88)\n\n\u2212 E[\n\n\u02dcgt)].\n\n3 One-Dimensional Private Adaptive Potential Function\n\nt=1\n\nt=1\n\nAlgorithm 1 Local Differentially Private Adaptive Potential Function\nRequire: G such that | E[\u02dcgt]| \u2264 G and prior P on v \u2208 [\u2212 1\n1: for t = 1, . . . , T do\n2:\n3:\n4: end for\n\nPlay wt = Ev\u223cP [v exp(\u2212(cid:80)t\u22121\n\nReceive symmetric \u02dcgt \u2208 R.\n\ns=1 v\u02dcgs \u2212 (v\u02dcgs)2)].\n\n5G , 1\n\n5G ].\n\nIn this section we derive a suitable potential function for a one-dimensional problem. In the remainder\nof this paper we use this one-dimensional potential to derive new algorithms. To derive our one-\ndimensional potential function we we rely on a property of symmetric random variables with bounded\nmeans. The following Lemma is key deriving our potential function FT .\nLemma 4. Suppose x is a symmetrical random variable with | E[(cid:104)v, x(cid:105)]| \u2264 1\nE[exp((cid:104)v, x(cid:105) \u2212 (cid:104)v, x(cid:105)2)] \u2264 1 + E[(cid:104)v, x(cid:105)].\nThe proof of Lemma 4 can be found in Appendix B. We can now use Lemma 4 to derive a one-\ndimensional potential function. Suppose \u02dcgt \u2208 R is a symmetrical random variable with E[\u02dcgt] \u2264 G.\nThen v\u02dcgt with v \u2208 [\u2212 1\n5G ] satis\ufb01es the assumptions in Lemma 4. Multiplying the lower bound of\nLemma 4 for 1 \u2212 E[v\u02dcgt], for t = 1, . . . , T , yields a potential function via Lemma 3. The potential\nwe \ufb01nd is\n\n5 for some v. Then\n\n5G , 1\n\ns=1\n\nv\u02dcgs \u2212 (v\u02dcgs)2) \u2212 1]],\n\nfunction is the incorporation of the symmetrical noise. The(cid:80)t\n\n\u02dcgs)] = E[ E\nv\u223cP\nwhere P is an (improper) prior on v \u2208 [\u2212 1\n5G ], the \ufb01rst expectation is over \u02dcg1, . . . , \u02dcgt, and\nF0(0) = 0. This kind of potential function has been used before by Chernov and Vovk (2010);\nKoolen and Van Erven (2015); Jun and Orabona (2019). The novelty in this particular potential\ns=1(v\u02dcgs)2 term is unique to our\npotential function and allows us to derive adaptive regret bounds for unconstrained u. Note that the\ncT = 1 term has moved inside the de\ufb01nition of FT . While this does not in\ufb02uence the analysis for\nproper priors it does in\ufb02uence the analysis for improper priors. The corresponding prediction strategy\nis given by\n\n5G , 1\n\n(4)\n\ns=1\n\nE[Ft(\u2212 t(cid:88)\n\n[exp(\u2212 t(cid:88)\n\n[v exp(\u2212 t\u22121(cid:88)\n\ns=1\n\nwt = E\nv\u223cP\n\nv\u02dcgs \u2212 (v\u02dcgs)2)].\n\n(5)\n\nAlgorithm 1 summarizes the strategy. Note that Algorithm 1 does not require any extra parameters\ncompared to the setting with noiseless subgradients.\nThe following result shows that FT de\ufb01ned by (4) and wt de\ufb01ned by (5) satisfy our assumptions.\nLemma 5. Suppose \u02dcgt is a symmetrical random variable with E[\u02dcgt] \u2264 G. Then Ft de\ufb01ned by (4)\n\nand wt de\ufb01ned by (5) satisfy E\u02dcgt[Ft(\u2212(cid:80)t\n\ns=1 \u02dcgs)] \u2264 Ft\u22121(\u2212(cid:80)t\u22121\n\ns=1 \u02dcgs) \u2212 wt E[\u02dcgt].\n\n5\n\n\fThe proof follows from an application of Lemma 4 and can be found in Appendix B. We consider\ntwo types of priors. The \ufb01rst type are proper priors that are of the form:\n\n(6)\n\n\u03bd(v) exp(\u2212bv2)\n\ndP (v)\n\n5G ] (cid:55)\u2192 R+, and Z =(cid:82) 1\n\ndv\n\n=\n\n,\n\nZ\n\u03bd(v)e\u2212bv2\n\n1\n\n5G\u2212 1\n\n5G\n\ndv =\n\n5G , 1\n\ndv = exp(\u2212bv2)\n\nZ\n\nZ|v| ln(|v|)2 (for G > 1\n\ndv is a normalizing constant. This\n\nWhere b \u2265 0, \u03bd : [\u2212 1\n(Koolen and\ncaptures several priors used in literature, including the conjugate prior dP\nVan Erven, 2015), a variant of the CV prior dP\n5), (Chernov and Vovk,\n2010; Koolen and Van Erven, 2015), and the uniform prior on [\u2212 1\n5G ] (Jun and Orabona, 2019).\nThe second type of prior is an improper prior: dP\ndv = 1|v|. A variant of this prior was previously used\nby (Koolen and Van Erven, 2015). For all priors we derive a regret bound by computing an upper\nbound on the convex conjugate of FT , F (cid:63)\nT . For conciseness we only present the regret bound for the\nconjugate prior in the main text. In Appendix C we present the analysis of the regret of the improper\nprior, for which a slightly different analysis is required compared to the proper priors. The analysis\nfor all priors can be seen as performing a Laplace approximation of the integral over v to show that\nthe prior places suf\ufb01cient mass in a neighbourhood of the optimal v.\ns=1 \u02dcgs, and C = 1\n\ns, Lt = \u2212(cid:80)t\u22121\nAbbreviating Bt = b +(cid:80)t\u22121\n(cid:17)(cid:16)\n(cid:16) Lt\u22122CBt\n(cid:16) (L+2CBt)2\n(cid:17) \u2212 erf\n\n5G, the predictions (5) with the\n\n(cid:16) Lt+2CBt\n\nconjugate prior are given by:\n\n(cid:17)(cid:17)\n\n5G , 1\n\ns=1 \u02dcg2\n\n\u221a\n\nerf\n\nbLt exp\n\n\u221a\n\n+ 2\n\nBt (exp(2CLt) \u2212 1)\n\n.\n\n\u221a\n2\n\nBt\n\n4Bt\n\nwt =\n\n\u221a\n\u221a\n2\n\nBt\n\nerf(C\n\nb) exp(C (Lt + CBt))4B\n\n3\n2\nt\n\n(7)\nThese wt can be computed ef\ufb01ciently, but see Koolen and Van Erven (2015) for numerically stable\nevaluation. With the conjugate prior we \ufb01nd the following result:\nTheorem 1. Suppose \u02dcgt is a symmetrical random variable with | E[\u02dcgt]| \u2264 G for all t. The predictions\n(7) satisfy:\n\nE[RT (u)] \u2264 1 + |u| max\n\nln(|u|11G) \u2212 1 + ln\n\n(cid:32)\n\n(cid:40)\n\uf8ee\uf8ef\uf8ef\uf8f0\n(cid:118)(cid:117)(cid:117)(cid:117)(cid:116)8\n\nE\n\n11G\n\n(cid:32)\n\nb +\n\n(cid:33)\n\nT(cid:88)\n\nt=1\n\n\u02dcg2\nt\n\nln(16|u|2\n\n\u03c0\n\n(cid:33)(cid:33)\n(cid:33) 3\n\n\u221a\n\u221a\n5G\n4\n\n(cid:32)\u221a\nT(cid:88)\n\nb\n\nb +\n\n\u02dcg2\nt\n\n,\n\n2 \u221a\n\u03c0\u221a\nb\n\nt=1\n\n(cid:32)\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\n(cid:41)\n\n.\n\n+ 1)\n\nThe proof of Theorem 1 can be found in Appendix B.1 and follows from computing the Fenchel\nconjugate of the potential function. For noisy subgradients this is the \ufb01rst bound that is adaptive to\nthe sum of the squares of the noisy subgradients. Compared to the expected regret bound for the\nimproper prior (see Theorem 3 in Appendix C) this bound has worse constants. However, with the\nconjugate prior all non-constant terms scale with |u|, which is not the case with the improper prior.\nFor all proper priors of the form (6) a similar regret bound can be computed. This can be seen from\nLemma 8 in Appendix B.1, which shows that the convex conjugate of the potential function for these\n\npriors is O(E[|u|(cid:113)(cid:80)T\n\nt ln(|u|T + 1))]).\n\nt=1 \u02dcg2\n\n3.1 Black-Box Reductions\n\nIn this section we use our potential function in a black-box reduction: we take a constrained noisy\nOCO algorithm AZ and turn it into an unconstrained algorithm using our potential function. The\nsame reduction is used by Cutkosky and Orabona (2018) and Jun and Orabona (2019). The algorithm\ncan be found in Figure 2. The potential function and the OCO algorithm each have their task: the\npotential function is to learn the norm of u and the constrained OCO algorithm is to learn the direction\nof u. In each round t we play wt = vtzt, where zt \u2208 Z, Z = {z : (cid:107)z(cid:107) \u2264 1}, is the prediction\nof the OCO algorithm and vt is the prediction of Algorithm 1. We feed \u02dcgt as feedback to AZ and\n(cid:104)zt, \u02dcgt(cid:105) as feedback to Algorithm 1. Since \u02dcgt is a symmetrical random variable and E[(cid:104)zt, \u02dcgt(cid:105)] \u2264 G,\n\n6\n\n\fAlgorithm 2 Black-Box Reduction\nRequire: G such that (cid:107) E[\u02dcgt](cid:107)(cid:63) \u2264 G and Algorithm AZ with domain Z = {z : (cid:107)z(cid:107) \u2264 1}\n1: for t = 1, . . . , T do\n2:\n3:\n4:\n5:\n6:\n7: end for\n\nGet zt \u2208 Z from AZ\nGet vt \u2208 R from Algorithm 1\nPlay wt = vtzt, receive symmetrical \u02dcgt such that (cid:107) E[\u02dcgt](cid:107)(cid:63) \u2264 G\nSend \u02dcgt to AZ\nSend (cid:104)zt, \u02dcgt(cid:105) to Algorithm 1\n\n(cid:104)zt, \u02dcgt(cid:105) satis\ufb01es the assumptions in Lemma 4. This allows us to control the regret for learning the\nnorm of u using Theorem 1.\nAs outlined by Cutkosky and Orabona (2018) the expected regret of Algorithm 2 decomposes into\ntwo parts. The \ufb01rst part of the regret is for learning the norm of u, and is controlled by Algorithm 1.\nThe second part of the regret for learning the direction of u and is controlled by AZ. The proof is\ngiven by Cutkosky and Orabona (2018), but for completeness we provide the proof in Appendix B.2.\nLemma 6. Suppose \u02dcgt is a symmetrical random variable with (cid:107) E[\u02dcgt](cid:107)(cid:63) \u2264 G for all t. Let\nt=1(vt \u2212 (cid:107)u(cid:107))(cid:104)zt, \u02dcgt(cid:105)] be the regret for learning (cid:107)u(cid:107) by Algorithm 1 and let\nRV\nRZ\nt=1(cid:104)zt \u2212 u(cid:107)u(cid:107) , \u02dcgt(cid:105)] be the regret for learning u(cid:107)u(cid:107) by AZ. Then Algorithm 2\nsatis\ufb01es E[RT (u)] = RV\n\nT ((cid:107)u(cid:107)) = E[(cid:80)T\nT ( u(cid:107)u(cid:107) ) = E[(cid:80)T\n\nT ((cid:107)u(cid:107)) + (cid:107)u(cid:107)RZ\n\n(cid:16) u(cid:107)u(cid:107)\n\n(cid:17)\n\nT\n\n.\n\nOrabona and P\u00e1l (2018) show that Mirror Descent with learning rates \u03b7t = (\nyields RZ\n\nT ( u(cid:107)u(cid:107) ) = O(E[\n\nt=1 (cid:107)\u02dcgt(cid:107)2\n(cid:63)]).\nt=1 (cid:107)\u02dcgt(cid:107)2\n(cid:63) + 1)]) the total regret of Algorithm 2 is\n\nSince Algorithm 1 satis\ufb01es RV\n\nO(E[(cid:107)u(cid:107)(cid:113)(cid:80)T\n\nt=1 (cid:107)\u02dcgt(cid:107)2\n\ns=1 (cid:107)\u02dcgs(cid:107)2\n(cid:63))\u22121\nT ((cid:107)u(cid:107)) =\n\n(cid:113)(cid:80)T\n(cid:63) ln((cid:107)u(cid:107)(cid:80)T\n\uf8eb\uf8ed(cid:107)u(cid:107) E\n\nE[RT (u)] = O\n\n(cid:113)(cid:80)t\n\uf8f9\uf8fb\uf8f6\uf8f8 .\n\n(cid:118)(cid:117)(cid:117)(cid:116) T(cid:88)\n\uf8ee\uf8f0\n\n(cid:63) ln((cid:107)u(cid:107) T(cid:88)\n\n(cid:107)\u02dcgt(cid:107)2\n\n(cid:107)\u02dcgt(cid:107)2\n\n(cid:63) + 1)\n\n(8)\n\nThis bound matches state of the art bounds for for noiseless subgradients and is never worse than the\nbound of Jun and Orabona (2019) for noisy subgradients, but can be substantially better.\n\nt=1\n\nt=1\n\n3.2 Private Unconstrained Adaptive Sparse Gradient Descent\n\nPlay wt\nfor j = 1, . . . , d do\n\nAlgorithm 3 Private Unconstrained Adaptive Sparse Gradient Descent\nRequire: G such that | E[\u02dcgt,j]|(cid:63) \u2264 G.\n1: for t = 1, . . . , T do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nReceive symmetrical \u02dcgt,j such that |\u02dcgt,j| \u2264 G\nSend \u02dcgt,j to Algorithm 1\nReceive vt+1 \u2208 R from Algorithm 1 with the conjugate prior\nSet wt+1,j = vt+1\n\nend for\n\nIn this section we propose a noisy unconstrained OCO algorithm that can exploit sparse subgradients.\nThe algorithm is summarized in Algorithm 3. Algorithm 3 runs a copy of Algorithm 1 with the\nconjugate prior coordinate-wise. A similar strategy is used by Orabona and Tommasi (2017). This\nstrategy can exploit sparse privacy structures, which, combined with sparse subgradients, may yield\nlow regret (see section 4). Its expected regret bound is given below. The proof follows from applying\nTheorem 1 per dimension.\n\n7\n\n\fTheorem 2. Suppose \u02dcgt,j is a symmetric random variable with | E[\u02dcgt,j]| \u2264 G for all t and j. Then\nthe expected regret of Algorithm 3 satis\ufb01es\n\nE[RT (u)] \u2264d +\n\n|uj| max\n\n11G\n\nln(|uj|11G) \u2212 1 + ln\n\nd(cid:88)\n\uf8ee\uf8ef\uf8ef\uf8f0\n(cid:118)(cid:117)(cid:117)(cid:117)(cid:116)8\n\nj=1\n\n(cid:32)\n\n(cid:32)\n(cid:33)\n\n(cid:40)\nT(cid:88)\n\nt=1\n\nE\n\nbj +\n\n\u02dcg2\nt,j\n\nln(16|uj|2\n\nbj +\n\n\u02dcg2\nt,j\n\n\u03c0\n\n5G\n\n\u221a\n\n(cid:32)\u221a\n4(cid:112)bj\n(cid:33) 3\n\u03c0(cid:112)bj\n\n2 \u221a\n\n(cid:33)(cid:33)\n\n,\n\n+ 1))\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\n(cid:41)\n\n.\n\n(cid:32)\n\nT(cid:88)\n\nt=1\n\n4 Motivating Examples\n\nd\n\n\u00012\n\nd\n\u0001\n\nIn this section we present two scenarios in which our algorithms provide better expected regret\nguarantees than standard algorithms. The \ufb01rst scenario concerns a case where many providers do not\ncare for their privacy (so they do not perturb the subgradients) and few providers care substantially\nfor their privacy. Suppose that the providers who care for their privacy are (cid:100)ln(T )(cid:101) of the total\nnumber of providers T . Suppose that (cid:107)gt(cid:107)2\n2 \u2264 1 and that the providers who care for their privacy use\n\u03c1(z) \u221d exp(\u2212 \u0001\n2(cid:107)z(cid:107)2), then E[(cid:107)\u03bet(cid:107)2\n(Song et al., 2015, Theorem 1). Using Algorithm\n2, Jensen\u2019s inequality, and the fact that the square root is subadditive we see from (8) that the expected\nregret is upper bounded by O((cid:107)u(cid:107)2\n\u0001 ln((cid:107)u(cid:107)2T + T )) instead\nof O((cid:107)u(cid:107)2\ninstead of letting the providers choose their desired level of privacy.\nIn the second scenario the providers use the local Laplace randomiser.\nis sparse. A standard algorithm that has good performance for sparse gt\n\n(cid:112)T ln(1 + (cid:107)u(cid:107)2T )) had we used the maximum privacy guarantee for all providers\n\n2] \u2264 4 + 4 d2+d\nt=1 (cid:107)gt(cid:107)2\n\n2 ln(1 + (cid:107)u(cid:107)2T ) +(cid:107)u(cid:107)2\n\n(cid:113)(cid:80)T\n\n(cid:113)(cid:80)T\n(Duchi et al., 2011). AdaGrad achieves O(E[D(cid:80)d\n(cid:113)(cid:80)T\nper bounded by O(D(cid:80)d\n(cid:113)\nO((cid:80)d\n3(cid:80)T\n\nSuppose that gt\nis AdaGrad\nt=1 |\u02dcgt,j|2]) expected regret, where\nmaxj |uj| \u2264 D, and D has to be guessed prior to running AdaGrad. Using Jensen\u2019s\ninequality and the fact that the square root is subadditive the expected regret can be up-\nt,j])). Algorithm 3 achieves\nt,j] ln(|uj|T + 1))) regret, which\ncan be signi\ufb01cantly smaller than the bound of AdaGrad if D is much larger than uj or if u is\nsparse. Furthermore, since we allow the provider of the data to choose \u03c4t,j, \u03bet can be sparse as\nwell. While this does not give local differential privacy guarantees for all attributes it does give\nlocal differential privacy guarantees for attributes with \u03c4j < \u221e. If instead we would have used the\nstandard application of local differential privacy there would be no hope in exploiting sparse gt since\n\n(cid:113)\n3(cid:80)T\nj=1(\nt,j] +\nt,j] ln(|uj|T + 1) +\nE[g2\n\n(cid:113)\n3(cid:80)T\n\nt=1 3 E[\u03be2\n\nj=1 |uj|(\n\nE[g2\n\nE[\u03be2\n\nj=1\n\nt=1\n\nt=1\n\nt=1\n\nE[\u03be2\n\nt=1\n\nt,j] would be the dominant term in the regret bound.\n\nj=1 |uj|(cid:113)\n(cid:80)d\n\n3(cid:80)T\n\n5 Conclusions\n\nIn this paper, we extended the local differential privacy framework in unconstrained Online Convex\nOptimization by allowing the provider of the data to choose their privacy guarantees. Standard\nalgorithms do not yield satisfactory regret bounds in this setting, either due to dependence on the\nunknown parameters of the noise or dependence on bounded subgradients. Hence, we proposed\ntwo new algorithms that match state of the art regret algorithms in both the noisy and noiseless\nsetting, without requiring knowledge of the noise other than symmetry. Our algorithms do not require\nparameters other than a bound on the norm of expectation of the subgradients, which allows the\nprivacy requirements of all providers to be private itself. The new algorithms are a step towards\npractically useful algorithms with local differential privacy guarantees that have sound theoretical\nguarantees. Our algorithms are the \ufb01rst adaptive unconstrained algorithms in the noisy OCO setting\nwithout requiring extra parameters compared to the standard OCO setting, solving two problems left\nopen by Jun and Orabona (2019).\n\n8\n\n\fAcknowledgments\n\nThe author would like to thank Tim van Erven for his comments on an earlier version of this paper.\nThe author was supported by the Netherlands Organization for Scienti\ufb01c Research (NWO grant\nTOP2EW.15.211).\n\nReferences\nAbernethy, J., C. Lee, A. McMillan, and A. Tewari\n\n2017. Online learning via differential privacy. arXiv preprint arXiv:1711.10019.\n\nAgarwal, N. and K. Singh\n\n2017. The price of differential privacy for online learning. In Proceedings of the 34th International\nConference on Machine Learning (ICML), Pp. 32\u201340.\n\nCesa-Bianchi, N. and G. Lugosi\n\n2006. Prediction, Learning, and Games. Cambridge university press.\n\nChernov, A. and V. Vovk\n\n2010. Prediction with advice of unknown number of experts. Uncertainty in Ariti\ufb01cial Intelligence,\nPp. 117\u2013125.\n\nCutkosky, A. and K. Boahen\n\n2017. Online learning without prior information. In Proceedings of the 30th Annual Conference\non Learning Theory (COLT), Pp. 643\u2013677.\n\nCutkosky, A. and F. Orabona\n\n2018. Black-box reductions for parameter-free online learning in banach spaces. In Proceedings\nof the 31th Annual Conference on Learning Theory (COLT), Pp. 1493\u20131529.\n\nDuchi, J., E. Hazan, and Y. Singer\n\n2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of\nMachine Learning Research, 12(Jul):2121\u20132159.\n\nDuchi, J. C., M. I. Jordan, and M. J. Wainwright\n\n2014. Privacy aware learning. Journal of the ACM (JACM), 61(6):38.\n\nDwork, C. and A. Roth\n\n2014. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical\nComputer Science, 9(3\u20134):211\u2013407.\n\nFoster, D. J., S. Kale, M. Mohri, and K. Sridharan\n\n2017. Parameter-free online learning via model selection. In Advances in Neural Information\nProcessing Systems, Pp. 6020\u20136030.\n\nFoster, D. J., A. Rakhlin, and K. Sridharan\n\n2015. Adaptive online learning. In Advances in Neural Information Processing Systems, Pp. 3375\u2013\n3383.\n\nFoster, D. J., A. Rakhlin, and K. Sridharan\n\n2018. Online learning: Suf\ufb01cient statistics and the burkholder method. In Proceedings of the 31th\nAnnual Conference on Learning Theory (COLT), Pp. 3028\u20133064.\n\nHazan, E.\n\n2016.\n2(3-4):157\u2013325.\n\nIntroduction to online convex optimization. Foundations and Trends in Optimization,\n\nHiriart-Urruty, J.-B.\n\n2006. A note on the legendre-fenchel transform of convex composite functions. In Nonsmooth\nMechanics and Analysis, Pp. 35\u201346. Springer.\n\nJain, P., P. Kothari, and A. Thakurta\n\n2012. Differentially private online learning. In Proceedings of the 25th Annual Conference on\nLearning Theory (COLT), Pp. 24\u20131.\n\n9\n\n\fJain, P. and A. G. Thakurta\n\n2014. (near) dimension independent risk bounds for differentially private learning. In Proceedings\nof the 31th International Conference on Machine Learning (ICML), Pp. 476\u2013484.\n\nJun, K.-S. and F. Orabona\n\n2019. Parameter-free online convex optimization with sub-exponential noise. Proceedings of the\n32nd Annual Conference on Learning Theory (COLT).\n\nKasiviswanathan, S. P., H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith\n\n2011. What can we learn privately? SIAM Journal on Computing, 40(3):793\u2013826.\n\nKoolen, W. M. and T. van Erven\n\n2015. Second-order quantile methods for experts and combinatorial games. In Proceedings of The\n28th Conference on Learning Theory (COLT), Pp. 1155\u20131175.\n\nKot\u0142owski, W.\n\n2017. Scale-invariant unconstrained online learning. In Proceedings of the International Conference\non Algorithmic Learning Theory (ALT), Pp. 412\u2013433.\n\nMcmahan, B. and M. Streeter\n\n2012. No-regret algorithms for unconstrained online convex optimization. In Advances in neural\ninformation processing systems, Pp. 2402\u20132410.\n\nMcMahan, H. B. and F. Orabona\n\n2014. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal\napproximations. In Proceedings of the 27th Annual Conference on Learning Theory (COLT),\nPp. 1020\u20131039.\n\nOrabona, F. and D. P\u00e1l\n\n2016. Coin betting and parameter-free online learning.\nProcessing Systems, Pp. 577\u2013585.\n\nIn Advances in Neural Information\n\nOrabona, F. and D. P\u00e1l\n\n2018. Scale-free online learning. Theoretical Computer Science, 716:50\u201369.\n\nOrabona, F. and T. Tommasi\n\n2017. Training deep networks without learning rates through coin betting. In Advances in Neural\nInformation Processing Systems, Pp. 2160\u20132170.\n\nSong, S., K. Chaudhuri, and A. Sarwate\n\n2015. Learning from data with heterogeneous noise using sgd. In Arti\ufb01cial Intelligence and\nStatistics, Pp. 894\u2013902.\n\nThakurta, A. G. and A. Smith\n\n2013. (nearly) optimal algorithms for private online learning in full-information and bandit settings.\nIn Advances in Neural Information Processing Systems, Pp. 2733\u20132741.\n\nWasserman, L. and S. Zhou\n\n2010. A statistical framework for differential privacy. Journal of the American Statistical Associa-\ntion, 105(489):375\u2013389.\n\n10\n\n\f", "award": [], "sourceid": 7860, "authors": [{"given_name": "Dirk", "family_name": "van der Hoeven", "institution": "Leiden University"}]}