{"title": "Differentially Private Empirical Risk Minimization Revisited: Faster and More General", "book": "Advances in Neural Information Processing Systems", "page_first": 2722, "page_last": 2731, "abstract": "In this paper we study differentially private Empirical Risk Minimization(ERM) in different settings. For smooth (strongly) convex loss function with or without (non)-smooth regularization, we give algorithms which achieve either optimal or near optimal utility bound with less gradient complexity compared with previous work.  For ERM with smooth convex loss function in high-dimension($p\\gg n$) setting, we give an algorithm which achieves the upper bound with less gradient complexity than previous ones. At last, we generalize the expected excess empirical risk from convex to Polyak-Lojasiewicz condition and give a tighter upper bound of the utility comparing with the result in \\cite{DBLP:journals/corr/ZhangZMW17}.", "full_text": "Differentially Private Empirical Risk Minimization\n\nRevisited: Faster and More General\u2217\n\nDi Wang\n\nMinwei Ye\n\nDept. of Computer Science and Engineering\n\nState University of New York at Buffalo\n\nDept. of Computer Science and Engineering\n\nState University of New York at Buffalo\n\nBuffalo, NY 14260\n\ndwang45@buffalo.edu\n\nBuffalo, NY 14260\n\nminweiye@buffalo.edu\n\nJinhui Xu\n\nDept. of Computer Science and Engineering\n\nState University of New York at Buffalo\n\nBuffalo, NY 14260\n\njinhui@buffalo.edu\n\nAbstract\n\nIn this paper we study the differentially private Empirical Risk Minimization\n(ERM) problem in different settings. For smooth (strongly) convex loss function\nwith or without (non)-smooth regularization, we give algorithms that achieve either\noptimal or near optimal utility bounds with less gradient complexity compared with\nprevious work. For ERM with smooth convex loss function in high-dimensional\n(p (cid:29) n) setting, we give an algorithm which achieves the upper bound with less\ngradient complexity than previous ones. At last, we generalize the expected excess\nempirical risk from convex loss functions to non-convex ones satisfying the Polyak-\nLojasiewicz condition and give a tighter upper bound on the utility than the one in\n[34].\n\n1\n\nIntroduction\n\nPrivacy preserving is an important issue in learning. Nowadays, learning algorithms are often required\nto deal with sensitive data. This means that the algorithm needs to not only learn effectively from\nthe data but also provide a certain level of guarantee on privacy preserving. Differential privacy is a\nrigorous notion for statistical data privacy and has received a great deal of attentions in recent years\n[11, 10]. As a commonly used supervised learning method, Empirical Risk Minimization (ERM)\nalso faces the challenge of achieving simultaneously privacy preserving and learning. Differentially\nPrivate (DP) ERM with convex loss function has been extensively studied in the last decade, starting\nfrom [7]. In this paper, we revisit this problem and present several improved results.\nProblem Setting Given a dataset D = {z1, z2 \u00b7\u00b7\u00b7 , zn} from a data universe X , and a closed\nconvex set C \u2286 Rp, DP-ERM is to \ufb01nd\n\nn(cid:88)\n\ni=1\n\nx\u2217 \u2208 arg min\n\nx\u2208C F r(x, D) = F (x, D) + r(x) =\n\n1\nn\n\nf (x, zi) + r(x)\n\nwith the guarantee of being differentially private. We refer to f as loss function. r(\u00b7) is some simple\n(non)-smooth convex function called regularizer. If the loss function is convex, the utility of the\n\u2217This research was supported in part by NSF through grants IIS-1422591, CCF-1422324, and CCF-1716400.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f[8][7]\n\n[21]\n\n[6]\n\n[34]\n\nMethod\n\nObjective Perturbation\n\nObjective Perturbation O( p\n\nGradient Perturbation\n\nOutput Perturbation\n\nn2\u00012 )\n\nO( p\nn2\u00012 + \u03bb||x\u2217||2\nO( p log2(n)\n\nn\u0001\n\n)\n\nn2\u00012\nO( p\n\nn2\u00012 )\n\nN/A\n\n)\n\nN/A\n\nO(n2)\n\nO(n\u03ba log( n\u0001\n\n\u03ba ))\n\nUtility Upper Bd\n\nGradient Complexity Non smooth Regularizer?\n\nNo\n\nYes\n\nYes\n\nNo\n\nYes\n\nThis Paper Gradient Perturbation\n\nO( p log(n)\nn2\u00012 )\n\nO((n + \u03ba) log( n\u0001\u00b5\n\np ))\n\np\n\nn2\u00012})[6].\n\nTable 1: Comparison with previous (\u0001, \u03b4)-DP algorithms. We assume that the loss function f is\nconvex, 1-smooth, differentiable (twice differentiable for objective perturbation), and 1-Lipschitz. F r\nis \u00b5-strongly convex. Bound and complexity ignore multiplicative dependence on log(1/\u03b4). \u03ba = L\n\u00b5\nis the condition number. The lower bound is \u2126(min{1,\nalgorithm is measured by the expected excess empirical risk, i.e. E[F r(xprivate, D)]\u2212 F r(x\u2217, D). The\nexpectation is over the coins of the algorithm.\nA number of approaches exist for this problem with convex loss function, which can be roughly\nclassi\ufb01ed into three categories. The \ufb01rst type of approaches is to perturb the output of a non-DP\nalgorithm. [7] \ufb01rst proposed output perturbation approach which is extended by [34]. The second\ntype of approaches is to perturb the objective function [7]. We referred to it as objective perturbation\napproach. The third type of approaches is to perturb gradients in \ufb01rst order optimization algorithms.\n[6] proposed gradient perturbation approach and gave the lower bound of the utility for both general\nconvex and strongly convex loss functions. Later, [28] showed that this bound can actually be broken\nby adding more restrictions on the convex domain C of the problem.\nAs shown in the following tables2 , the output perturbation approach can achieve the optimal bound of\nutility for strongly convex case. But it cannot be generalized to the case with non-smooth regularizer.\nThe objective perturbation approach needs to obtain the optimal solution to ensure both differential\nprivacy and utility, which is often intractable in practice, and cannot achieve the optimal bound. The\ngradient perturbation approach can overcome all the issues and thus is preferred in practice. However,\nits existing results are all based on Gradient Descent (GD) or Stochastic Gradient Descent (SGD).\nFor large datasets, they are slow in general. In the \ufb01rst part of this paper, we present algorithms\nwith tighter utility upper bound and less running time. Almost all the aforementioned results did\nnot consider the case where the loss function is non-convex. Recently, [34] studied this case and\nmeasured the utility by gradient norm. In the second part of this paper, we generalize the expected\nexcess empirical risk from convex to Polyak-Lojasiewicz condition, and give a tighter upper bound of\nthe utility given in [34]. Due to space limit, we leave many details, proofs, and experimental studies\nin the supplement.\n\n2 Related Work\n\nThere is a long list of works on differentially private ERM in the last decade which attack the problem\nfrom different perspectives. [17][30] and [2] investigated regret bound in online settings. [20] studied\nregression in incremental settings. [32] and [31] explored the problem from the perspective of\nlearnability and stability. We will compare to the works that are most related to ours from the utility\nand gradient complexity (i.e., the number (complexity) of \ufb01rst order oracle (f (x, zi),\u2207f (x, zi))\nbeing called) points of view. Table 1 is the comparison for the case that loss function is strongly\nconvex and 1-smooth. Our algorithm achieves near optimal bound with less gradient complexity\ncompared with previous ones. It is also robust to non-smooth regularizers.\nTables 2 and 3 show that for non-strongly convex and high-dimension cases, our algorithms outper-\nform other peer methods. Particularly, we improve the gradient complexity from O(n2) to O(n log n)\nwhile preserving the optimal bound for non-strongly convex case. For high-dimension case, gradient\ncomplexity is reduced from O(n3) to O(n1.5). Note that [19] also considered high-dimension case\n\n2 Bound and complexity ignore multiplicative dependence on log(1/\u03b4).\n\n2\n\n\f[21]\n\n[6]\n\n[34]\n\nMethod\n\nObjective Perturbation\n\nGradient Perturbation\n\nO(\n\nOutput Perturbation\n\nO([\n\nThis paper Gradient Perturbation\n\nUtility Upper Bd Gradient Complexity Non smooth Regularizer?\n\n\u221a\np\nn\u0001 )\n\nO(\n\u221a\n\n)\n\nN/A\n\nO(n2)\n\nO(n[ n\u0001\n\nd ] 2\n3 )\n\nO( n\u0001\u221a\n\np + n log( n\u0001\n\np ))\n\nYes\n\nYes\n\nNo\n\nYes\n\np log3/2(n)\n\np\n\nn\u0001\n\u221a\nn\u0001 ] 2\n3 )\n\u221a\np\nn\u0001 )\n\nO(\n\nTable 2: Comparison with previous (\u0001, \u03b4)-DP algorithms, where F r is not necessarily strongly convex.\nWe assume that the loss function f is convex, 1-smooth, differentiable( twice differentiable for\nobjective perturbation), and 1-Lipschitz. Bound and complexity ignore multiplicative dependence on\nlog(1/\u03b4). The lower bound in this case is \u2126(min{1,\n\np\n\n\u221a\nn\u0001 })[6].\n\nvia dimension reduction. But their method requires the optimal value in the dimension-reduced space,\nin addition they considered loss functions under the condition rather than (cid:96)2- norm Lipschitz.\nFor non-convex problem under differential privacy, [15][9][13] studied private SVD. [14] investigated\nk-median clustering. [34] studied ERM with non-convex smooth loss functions. In [34], the authors\nde\ufb01ned the utility using gradient norm as E[||\u2207F (xprivate)||2]. They achieved a quali\ufb01ed utility in\nO(n2) gradient complexity via DP-SGD. In this paper, we use DP-GD and show that it has a tighter\nutility upper bound.\n\nUtility Upper Bd\n\nMethod\nGradient Perturbation\nObjective Perturbation O( GC+\u03bb||C||2\n\nO(\n\nn\u0001\n\nG2C+||C||2 log(n)\n\n)\n\n\u221a\n\nn\u0001\n\nGradient Complexity Non smooth Regularizer?\n\n) O(\n\nn3\u00012\n\n(G2C+||C||2) log2(n) ) Yes\nNo\n\nN/A\n\n[28]\n\n[28]\n\n[29]\n\nGradient Perturbation\n\nThis paper Gradient Perturbation\n\nO( (G\n\u221a\n\nO(\n\n2\n\n3C log2(n))\n(n\u0001)\n\n2\n3\n\nG2C+||C||2\n\nn\u0001\n\n)\n\nO( (n\u0001)\n2\n3C\nG\n\n(cid:18)\n\n2\n3\n\n)\n\nO\n\n)\nn1.5\u221a\n\n\u0001\n\n(G2C+||C||2)\n\n(cid:19)\n\n1\n4\n\nYes\n\nNo\n\nTable 3: Comparison with previous (\u0001, \u03b4)-DP algorithms. We assume that the loss function f is\nconvex, 1-smooth, differentiable( twice differentiable for objective perturbation), and 1-Lipschitz.\nThe utility bound depends on GC, which is the Gaussian width of C. Bound and complexity ignore\nmultiplicative dependence on log(1/\u03b4).\n\n3 Preliminaries\nNotations: We let [n] denote {1, 2, . . . , n}. Vectors are in column form. For a vector v, we use\n||v||2 to denote its (cid:96)2-norm. For the gradient complexity notation, G, \u03b4, \u0001 are omitted unless speci\ufb01ed.\nD = {z1,\u00b7\u00b7\u00b7 , zn} is a dataset of n individuals.\nDe\ufb01nition 3.1 (Lipschitz Function over \u03b8). A loss function f : C \u00d7 X \u2192 R is G-Lipschitz (under\n(cid:96)2-norm) over \u03b8, if for any z \u2208 X and \u03b81, \u03b82 \u2208 C, we have |f (\u03b81, z) \u2212 f (\u03b82, z)| \u2264 G||\u03b81 \u2212 \u03b82||2.\nDe\ufb01nition 3.2 (L-smooth Function over \u03b8). A loss function f : C \u00d7 X \u2192 R is L-smooth over \u03b8 with\nrespect to the norm || \u00b7 || if for any z \u2208 X and \u03b81, \u03b82 \u2208 C, we have\n\n||\u2207f (\u03b81, z) \u2212 \u2207f (\u03b82, z)||\u2217 \u2264 L||\u03b81 \u2212 \u03b82||,\nwhere || \u00b7 ||\u2217 is the dual norm of || \u00b7 ||. If f is differentiable, this yields\nL\n2\n\nf (\u03b81, z) \u2264 f (\u03b82, z) + (cid:104)\u2207f (\u03b82, z), \u03b81 \u2212 \u03b82(cid:105) +\n\n||\u03b81 \u2212 \u03b82||2.\n\nWe say that two datasets D, D(cid:48) are neighbors if they differ by only one entry, denoted as D \u223c D(cid:48).\nDe\ufb01nition 3.3 (Differentially Private[11]). A randomized algorithm A is (\u0001, \u03b4)-differentially private\nif for all neighboring datasets D, D(cid:48) and for all events S in the output space of A, we have\n\nP r(A(D) \u2208 S) \u2264 e\u0001P r(A(D(cid:48)) \u2208 S) + \u03b4,\n\n3\n\n\fwhen \u03b4 = 0 and A is \u0001-differentially private.\n\nWe will use Gaussian Mechanism [11] and moments accountant [1] to guarantee (\u0001, \u03b4)-DP.\nDe\ufb01nition 3.4 (Gaussian Mechanism). Given any function q : X n \u2192 Rp, the Gaussian Mechanism\nis de\ufb01ned as:\n\nMG(D, q, \u0001) = q(D) + Y,\n\n\u221a\n\nwhere Y is drawn from Gaussian Distribution N (0, \u03c32Ip) with \u03c3 \u2265\n. Here \u22062(q)\nis the (cid:96)2-sensitivity of the function q, i.e. \u22062(q) = supD\u223cD(cid:48) ||q(D)\u2212q(D(cid:48))||2. Gaussian Mechanism\npreservers (\u0001, \u03b4)-differentially private.\n\n2 ln(1.25/\u03b4)\u22062(q)\n\n\u0001\n\nThe moments accountant proposed in [1] is a method to accumulate the privacy cost which has tighter\nbound for \u0001 and \u03b4. Roughly speaking, when we use the Gaussian Mechanism on the (stochastic)\n\ngradient descent, we can save a factor of(cid:112)ln(T /\u03b4) in the asymptotic bound of standard deviation of\n\nnoise compared with the advanced composition theorem in [12].\nTheorem 3.1 ([1]). For G-Lipschitz loss function, there exist constants c1 and c2 so that given the\nsampling probability q = l/n and the number of steps T, for any \u0001 < c1q2T , a DP stochastic gradient\nalgorithm with batch size l that injects Gaussian Noise with standard deviation G\nn \u03c3 to the gradients\n(Algorithm 1 in [1]), is (\u0001, \u03b4)-differentially private for any \u03b4 > 0 if\n\nq(cid:112)T ln(1/\u03b4)\n\n.\n\n\u0001\n\n\u03c3 \u2265 c2\n\n4 Differentially Private ERM with Convex Loss Function\n\nIn this section we will consider ERM with (non)-smooth regularizer3, i.e.\n\nn(cid:88)\n\ni=1\n\nF r(x, D) = F (x, D) + r(x) =\n\nmin\nx\u2208Rp\n\n1\nn\n\nf (x, zi) + r(x).\n\n(1)\n\nThe loss function f is convex for every z. We de\ufb01ne the proximal operator as\n\nproxr(y) = arg min\nx\u2208Rp\n\n||x \u2212 y||2\n\n2 + r(x)},\n\n{ 1\n2\n\nand denote x\u2217 = arg minx\u2208Rp F r(x, D).\n\n\u02dcx = \u02dcxs\u22121\n\u02dcv = \u2207F (\u02dcx)\nxs\n0 = \u02dcx\nfor t = 1, 2,\u00b7\u00b7\u00b7 , m do\n\nAlgorithm 1 DP-SVRG(F r, \u02dcx0, T, m, \u03b7, \u03c3)\nInput: f (x, z) is G-Lipschitz and L-smooth. F r(x, D) is \u00b5-strongly convex w.r.t (cid:96)2-norm. \u02dcx0 is the\ninitial point, \u03b7 is the step size, T, m are the iteration numbers.\n1: for s = 1, 2,\u00b7\u00b7\u00b7 , T do\n2:\n3:\n4:\n5:\nt \u2208 [n]\nPick is\n6:\nt = \u2207f (xs\nvs\n7:\n(cid:80)m\nt = prox\u03b7r(xs\nxs\n8:\nend for\n9:\n\u02dcxs = 1\n10:\nm\n11: end for\n12: return \u02dcxT\n\nt \u223c N (0, \u03c32Ip)\n\n) \u2212 \u2207f (\u02dcx, zis\n\nt\u22121 \u2212 \u03b7vs\nt )\n\nt , where us\n\n) + \u02dcv + us\n\nt\u22121, zis\n\nt\n\nk=1 xs\nk\n\nt\n\n3 All of the algorithms and theorems in this section are applicable to closed convex set C rather than Rp.\n\n4\n\n\f4.1 Strongly convex case\n\nWe \ufb01rst consider the case that F r(x, D) is \u00b5-strongly convex, Algorithm 1 is based on the Prox-\nSVRG [33], which is much faster than SGD or GD. We will show that DP-SVRG is also faster than\nDP-SGD or DP-GD in terms of the time needed to achieve the near optimal excess empirical risk\nbound.\nDe\ufb01nition 4.1 (Strongly Convex). The function f (x) is \u00b5-strongly convex with respect to norm || \u00b7 ||\nif for any x, y \u2208 dom(f ), there exist \u00b5 > 0 such that\n\nf (y) \u2265 f (x) + (cid:104)\u2202f, y \u2212 x(cid:105) +\n\n||y \u2212 x||2,\n\n\u00b5\n2\n\nwhere \u2202f is any subgradient on x of f.\nTheorem 4.1. In DP-SVRG(Algorithm 1), for \u0001 \u2264 c1\n(\u0001, \u03b4)-differentially private if\n\nT m\n\nn2 with some constant c1 and \u03b4 > 0, it is\n\n(2)\n\n(3)\n\n\u03c32 = c\n\nG2T m ln( 1\n\u03b4 )\n\nn2\u00012\n\nL ) \u2264 1\n(cid:16)\n\n1\n\n\u03b7(1 \u2212 8\u03b7L)\u00b5m\n\n+\n\nfor some constant c.\nRemark 4.1. The constraint on \u0001 in Theorems 4.1 and 4.3 comes from Theorem 3.1. This constraint\ncan be removed if the noise \u03c3 is ampli\ufb01ed by a factor of O(ln(T /\u03b4)) in (3) and (6). But accordingly\nthere will be a factor of \u02dcO(log(T m/\u03b4)) in the utility bound in (5) and (7). In this case the guarantee\nof differential privacy is by advanced composition theorem and privacy ampli\ufb01cation via sampling[6].\nTheorem 4.2 (Utility guarantee). Suppose that the loss function f (x, z) is convex, G-Lipschitz and\nL-smooth over x. F r(x, D) is \u00b5-strongly convex w.r.t (cid:96)2-norm. In DP-SVRG(Algorithm 1), let \u03c3\nbe as in (3). If one chooses \u03b7 = \u0398( 1\n\u00b5 ) so that they satisfy\ninequality\n\n12L and suf\ufb01ciently large m = \u0398( L\n\n,\n\n<\n\n1\n2\n\n8L\u03b7(m + 1)\nm(1 \u2212 8L\u03b7)\n\n(cid:17)\n(cid:18) p log(n)G2 log(1/\u03b4)\n\n,\n\nn2\u00012\u00b5\n\n(cid:19)\n\n,\n\n(4)\n\n(5)\n\nthen the following holds for T = O\n\nn2\u00012\u00b5\n\nlog(\n\npG2 ln(1/\u03b4) )\nE[F r(\u02dcxT , D)] \u2212 F r(x\u2217, D) \u2264 \u02dcO\n\n(cid:16)\n\n(cid:17)\n\n(n + L\n\nwhere some insigni\ufb01cant logarithm terms are hiding in the \u02dcO-notation. The total gradient complexity\nis O\nRemark 4.2. We can further use some acceleration methods to reduce the gradient complexity, see\n[25][3].\n\n\u00b5 ) log n\u0001\u00b5\n\n.\n\np\n\n4.2 Non-strongly convex case\n\nIn some cases, F r(x, D) may not be strongly convex. For such cases, [5] has recently showed that\nSVRG++ has less gradient complexity than Accelerated Gradient Descent. Following the idea of\nDP-SVRG, we present the algorithm DP-SVRG++ for the non-strongly convex case. Unlike the\nprevious one, this algorithm can achieve the optimal utility bound.\nTheorem 4.3. In DP-SVRG++(Algorithm 2), for \u0001 \u2264 c1\n(\u0001, \u03b4)-differentially private if\n\n2T m\nn2 with some constant c1 and \u03b4 > 0, it is\n\n\u03c32 = c\n\nG22T m ln( 2\n\u03b4 )\n\nn2\u00012\n\n(6)\n\nfor some constant c.\nTheorem 4.4 (Utility guarantee). Suppose that the loss function f (x, z) is convex, G-Lipschitz and\nL-smooth. In DP-SVRG++(Algorithm 2), if \u03c3 is chosen as in (6), \u03b7 = 1\n13L, and m = \u0398(L) is\n,\n\nsuf\ufb01ciently large, then the following holds for T = O\n\n(cid:18)\n\nlog(\n\nE[F r(\u02dcxT , D)] \u2212 F r(x\u2217, D) \u2264 O\n\n.\n\n(7)\n\n(cid:19)\n(cid:33)\n\n)\n\n(cid:32)\n\n\u221a\n\n\u221a\n\nn\u0001\nlog(1/\u03b4)\n\nG(cid:112)p ln(1/\u03b4))\n\nG\n\np\n\nn\u0001\n\nThe gradient complexity is O\n\np + n log( n\u0001\np )\n\n.\n\n(cid:16) nL\u0001\u221a\n\n(cid:17)\n\n5\n\n\fAlgorithm 2 DP-SVRG++(F r, \u02dcx0, T, m, \u03b7, \u03c3)\nInput:f (x, z) is G-Lipschitz, and L-smooth over x \u2208 C. \u02dcx0 is the initial point, \u03b7 is the step size, and\nT, m are the iteration numbers.\n\nx1\n0 = \u02dcx0\nfor s = 1, 2,\u00b7\u00b7\u00b7 , T do\n\n\u02dcv = \u2207F (\u02dcxs\u22121)\nms = 2sm\nfor t = 1, 2,\u00b7\u00b7\u00b7 , ms do\n\nt \u2208 [n]\nPick is\nt = \u2207f (xs\nvs\n(cid:80)ms\nt = prox\u03b7r(xs\nxs\nend for\n\u02dcxs = 1\nms\nxs+1\n0 = xs\nend for\nreturn \u02dcxT\n\nk=1 xs\nk\n\nms\n\n) \u2212 \u2207f (\u02dcxs\u22121, zis\n\nt\n\n) + \u02dcv + ut\n\ns, where ut\n\ns \u223c N (0, \u03c32Ip)\n\nt\u22121, zis\n\nt\n\nt\u22121 \u2212 \u03b7vs\nt )\n\n5 Differentially Private ERM for Convex Loss Function in High Dimensions\n\nThe utility bounds and gradient complexities in Section 4 depend on dimensionality p. In high-\ndimensional (i.e., p (cid:29) n) case, such a dependence is not very desirable. To alleviate this issue, we\ncan usually get rid of the dependence on dimensionality by reformulating the problem so that the\ngoal is to \ufb01nd the parameter in some closed centrally symmetric convex set C \u2286 Rp (such as l1-norm\nball), i.e.,\n\nmin\nx\u2208C F (x, D) =\n\n1\nn\n\nf (x, zi),\n\n(8)\n\nn(cid:88)\n\ni=1\n\n\u221a\n\nwhere the loss function is convex.\np term in (5),(7) can be replaced by the Gaussian Width of C, which is no\n[28],[29] showed that the\np) and can be signi\ufb01cantly smaller in practice (for more detail and examples one\nlarger than O(\nmay refer to [28]). In this section, we propose a faster algorithm to achieve the upper utility bound.\nWe \ufb01rst give some de\ufb01nitions.\n\n\u221a\n\nAlgorithm 3 DP-AccMD(F, x0, T, \u03c3, w)\nInput:f (x, z) is G-Lipschitz , and L-smooth over x \u2208 C . ||C||2 is the (cid:96)2 norm diameter of the\nconvex set C. w is a function that is 1-strongly convex w.r.t || \u00b7 ||C. x0 is the initial point, and T is the\niteration number.\n\nDe\ufb01ne Bw(y, x) = w(y) \u2212 (cid:104)\u2207w(x), y \u2212 x(cid:105) \u2212 w(x)\ny0, z0 = x0\nfor k = 0,\u00b7\u00b7\u00b7 , T \u2212 1 do\n4L and rk =\n\n1\n\n2\u03b1k+1L\n\n\u03b1k+1 = k+2\nxk+1 = rkzk + (1 \u2212 rk)yk\nyk+1 = arg miny\u2208C{ L||C||2\nzk+1 = arg minz\u2208C{Bw(z, zk) + \u03b1k+1(cid:104)\u2207F (xk+1) + bk+1, z \u2212 zk(cid:105)}, where bk+1 \u223c\n\n||y \u2212 xk+1||2C + (cid:104)\u2207F (xk+1), y \u2212 xk+1(cid:105)}\n\n2\n\n2\n\nN (0, \u03c32Ip)\nend for\nreturn yT\n\nDe\ufb01nition 5.1 (Minkowski Norm). The Minkowski norm (denoted by || \u00b7 ||C) with respect to a\ncentrally symmetric convex set C \u2286 Rp is de\ufb01ned as follows. For any vector v \u2208 Rp,\n\nThe dual norm of || \u00b7 ||C is denoted as || \u00b7 ||C\u2217, for any vector v \u2208 Rp, ||v||C\u2217 = maxw\u2208C |(cid:104)w, v(cid:105)|.\n\n|| \u00b7 ||C = min{r \u2208 R+ : v \u2208 rC}.\n\n6\n\n\f2-smooth with respect to || \u00b7 ||C norm.\n\nThe following lemma implies that for every smooth convex function f (x, z) which is L-smooth with\nrespect to (cid:96)2 norm, it is L||C||2\nLemma 5.1. For any vector v, we have ||v||2 \u2264 ||C||2||v||C, where ||C||2 is the (cid:96)2-diameter and\n||C||2 = supx,y\u2208C ||x \u2212 y||2.\nDe\ufb01nition 5.2 (Gaussian Width). Let b \u223c N (0, Ip) be a Gaussian random vector in Rp. The Gaussian\nwidth for a set C is de\ufb01ned as GC = Eb[supw\u2208C(cid:104)b, w(cid:105)].\nLemma 5.2 ([28]). For W = (maxw\u2208C(cid:104)w, v(cid:105))2 where v \u223c N (0, Ip), we have Ev[W ] = O(G2C +\n||C||2\n2).\nOur algorithm DP-AccMD is based on the Accelerated Mirror Descent method, which was studied\nin [4],[23].\nTheorem 5.3. In DP-AccMD( Algorithm 3), for \u0001, \u03b4 > 0, it is (\u0001, \u03b4)-differentially private if\n\n\u03c32 = c\n\nG2T ln(1/\u03b4)\n\nn2\u00012\n\n(9)\n\nfor some constant c.\nTheorem 5.4 (Utility Guarantee). Suppose the loss function f (x, z) is G-Lipschitz , and L-smooth\nover x \u2208 C . In DP-AccMD, let \u03c3 be as in (9) and w be a function that is 1-strongly convex with\nrespect to || \u00b7 ||C. Then if\n\n(cid:32)\n\n2\n\nL||C||2\n\n(cid:33)\n(cid:112)Bw(x\u2217, x0)n\u0001\nG(cid:112)ln(1/\u03b4)(cid:112)G2C + ||C||2\n(cid:32)(cid:112)Bw(x\u2217, x0)(cid:112)G2C + ||C||2\n\n2\n\n,\n\n(cid:18)\n\n\u221a\n\nn1.5\n\n\u0001L\n(G2C+||C||2\n2)\n\n1\n4\n\n(cid:19)\n\nn\u0001\n\n.\n\n2G(cid:112)ln(1/\u03b4)\n\n(cid:33)\n\n.\n\nT 2 = O\n\nwe have\n\nE[F (yT , D)] \u2212 F (x\u2217, D) \u2264 O\n\nThe total gradient complexity is O\n\n6 ERM for General Functions\n\nIn this section, we consider non-convex functions with similar objective function as before,\n\nmin\nx\u2208Rp\n\nF (x, D) =\n\n1\nn\n\nf (x, zi).\n\n(10)\n\nn(cid:88)\n\ni=1\n\nAlgorithm 4 DP-GD(x0, F, \u03b7, T, \u03c3, D)\nInput:f (x, z) is G-Lipschitz , and L-smooth over x \u2208 C . F is under the assumptions. 0 < \u03b7 \u2264 1\nis the step size. T is the iteration number.\n\nL\n\nfor t = 1, 2,\u00b7\u00b7\u00b7 , T do\n\nxt = xt\u22121 \u2212 \u03b7 (\u2207F (xt\u22121, D) + zt\u22121), where zt\u22121 \u223c N (0, \u03c32Ip)\n\nend for\nreturn xT (For section 6.1)\nreturn xm where m is uniform sampled from {0, 1,\u00b7\u00b7\u00b7 , m \u2212 1}(For section 6.2)\n\nTheorem 6.1. In DP-GD( Algorithm 4), for \u0001, \u03b4 > 0, it is (\u0001, \u03b4)-differentially private if\n\nfor some constant c.\n\n\u03c32 = c\n\nG2T ln(1/\u03b4)\n\nn2\u00012\n\n7\n\n(11)\n\n\f6.1 Excess empirical risk for functions under Polyak-Lojasiewicz condition\n\nIn this section, we consider excess empirical risk in the case where the objective function F (x, D)\nsatis\ufb01es Polyak-Lojasiewicz condition. This topic has been studied in [18][27][26][24][22].\nDe\ufb01nition 6.1 ( Polyak-Lojasiewicz condition). For function F (\u00b7), denote X \u2217 = arg minx\u2208Rp F (x)\nand F \u2217 = minx\u2208Rp F (x). Then there exists \u00b5 > 0 and for every x,\n||\u2207F (x)||2 \u2265 2\u00b5(F (x) \u2212 F \u2217).\n\n(12)\n\n(12) guarantees that every critical point (i.e., the point where the gradient vanish) is the global\nminimum. [18] shows that if F is differentiable and L-smooth w.r.t (cid:96)2 norm, then we have the\nfollowing chain of implications:\nStrong Convex \u21d2 Essential Strong Convexity\u21d2 Weak Strongly Convexity \u21d2 Restricted Secant\nInequality \u21d2 Polyak-Lojasiewicz Inequality \u21d4 Error Bound\nTheorem 6.2. Suppose that f (x, z) is G-Lipschitz, and L-smooth over xC, and F (x, D) satis\ufb01es\nthe Polyak-Lojasiewicz condition. In DP-GD( Algorithm 4), let \u03c3 be as in (11) with \u03b7 = 1\nL. Then if\nT = \u02dcO\n\n, the following holds\n\n(cid:17)\n\n(cid:16)\n\nlog(\n\nn2\u00012\n\npG2 log(1/\u03b4) )\n\nE[F (xT , D)] \u2212 F (x\u2217, D) \u2264 O(\n\nG2p log2(n) log(1/\u03b4)\n\nn2\u00012\n\n),\n\n(13)\n\nwhere \u02dcO hides other log, L, \u00b5 terms.\n\nDP-GD achieves near optimal bound since strongly convex functions can be seen as a special case in\nthe class of functions satisfying Polyak-Lojasiewicz condition. The lower bound for strongly convex\nfunctions is \u2126(min{1,\nn2\u00012})[6]. Our result has only a logarithmic multiplicative term comparing to\nthat. Thus we achieve near optimal bound in this sense.\n\np\n\n6.2 Tight upper bound for (non)-convex case\n\nIn [34], the authors considered (non)-convex smooth loss functions and measured the utility as\n||F (xprivate, D)||2. They proposed an algorithm with gradient complexity O(n2). For this algorithm,\nthey showed that E[||F (xprivate, D)||2] \u2264 O(\n). By using DP-GD( Algorithm 4), we\ncan eliminate the log(n) term.\nTheorem 6.3. Suppose that f (x, z) is G-Lipschitz, and L-smooth. In DP-GD( Algorithm 4), let \u03c3\nbe as in (11) with \u03b7 = 1\n\np log(1/\u03b4)\nn\u0001\n\nL. Then when T = O(\n\nlog(n)\n\n\u221a\n\n\u221a\n\nLn\u0001\n\n\u221a\n\np log(1/\u03b4)G\n\n), we have\n\nLG(cid:112)p log(1/\u03b4)\n\n\u221a\n\nE[||\u2207F (xm, D)||2] \u2264 O(\n\n).\n\n(14)\n\nn\u0001\n\nRemark 6.1. Although we can obtain the optimal bound by Theorem 3.1 using DP-SGD, there will\nbe a constraint on \u0001. Also, we still do not know the lower bound of the utility using this measure. We\nleave it as an open problem.\n\n7 Discussions\n\nFrom the discussion in previous sections, we know that when gradient perturbation is combined\nwith linearly converge \ufb01rst order methods, near optimal bound with less gradient complexity can be\nachieved. The remaining issue is whether the optimal bound can be obtained in this way. In Section\n6.1, we considered functions satisfying the Polyak-Lojasiewicz condition, and achieved near optimal\nbound on the utility. It will be interesting to know the bound for functions satisfying other conditions\n(such as general Gradient-dominated functions [24], quasi-convex and locally-Lipschitz in [16])\nunder the differential privacy model. For general non-smooth convex loss function (such as SVM\n), we do not know whether the optimal bound is achievable with less time complexity. Finally, for\nnon-convex loss function, proposing an easier interpretable measure for the utility is another direction\nfor future work.\n\n8\n\n\fReferences\n[1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep\nlearning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on\nComputer and Communications Security, pages 308\u2013318. ACM, 2016.\n\n[2] N. Agarwal and K. Singh. The price of differential privacy for online learning. In Proceedings\nof the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia,\n6-11 August 2017, pages 32\u201340, 2017.\n\n[3] Z. Allen-Zhu. Katyusha: the \ufb01rst direct acceleration of stochastic gradient methods.\n\nIn\nProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n1200\u20131205. ACM, 2017.\n\n[4] Z. Allen-Zhu and L. Orecchia. Linear Coupling: An Ultimate Uni\ufb01cation of Gradient and Mirror\nDescent. In Proceedings of the 8th Innovations in Theoretical Computer Science, ITCS \u201917,\n2017.\n\n[5] Z. Allen-Zhu and Y. Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex\nIn Proceedings of the 33rd International Conference on Machine Learning,\n\nObjectives.\nICML \u201916, 2016.\n\n[6] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Ef\ufb01cient algorithms\nand tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual\nSymposium on, pages 464\u2013473. IEEE, 2014.\n\n[7] K. Chaudhuri and C. Monteleoni. Privacy-preserving logistic regression. In Advances in Neural\n\nInformation Processing Systems, pages 289\u2013296, 2009.\n\n[8] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk mini-\n\nmization. Journal of Machine Learning Research, 12(Mar):1069\u20131109, 2011.\n\n[9] K. Chaudhuri, A. Sarwate, and K. Sinha. Near-optimal differentially private principal compo-\n\nnents. In Advances in Neural Information Processing Systems, pages 989\u2013997, 2012.\n\n[10] C. Dwork. Differential privacy: A survey of results. In International Conference on Theory and\n\nApplications of Models of Computation, pages 1\u201319. Springer, 2008.\n\n[11] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private\n\ndata analysis. In Theory of Cryptography Conference, pages 265\u2013284. Springer, 2006.\n\n[12] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In Foundations of\nComputer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 51\u201360. IEEE, 2010.\n\n[13] C. Dwork, K. Talwar, A. Thakurta, and L. Zhang. Analyze gauss: optimal bounds for privacy-\npreserving principal component analysis. In Proceedings of the 46th Annual ACM Symposium\non Theory of Computing, pages 11\u201320. ACM, 2014.\n\n[14] D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private coresets. In Proceedings of the forty-\ufb01rst\n\nannual ACM symposium on Theory of computing, pages 361\u2013370. ACM, 2009.\n\n[15] M. Hardt and A. Roth. Beyond worst-case analysis in private singular vector computation. In\nProceedings of the forty-\ufb01fth annual ACM symposium on Theory of computing, pages 331\u2013340.\nACM, 2013.\n\n[16] E. Hazan, K. Levy, and S. Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex opti-\n\nmization. In Advances in Neural Information Processing Systems, pages 1594\u20131602, 2015.\n\n[17] P. Jain, P. Kothari, and A. Thakurta. Differentially private online learning. In COLT, volume 23,\n\npages 24\u20131, 2012.\n\n[18] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient\nmethods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on Machine\nLearning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n9\n\n\f[19] S. P. Kasiviswanathan and H. Jin. Ef\ufb01cient private empirical risk minimization for high-\nIn Proceedings of The 33rd International Conference on Machine\n\ndimensional learning.\nLearning, pages 488\u2013497, 2016.\n\n[20] S. P. Kasiviswanathan, K. Nissim, and H. Jin. Private incremental regression. In Proceedings of\nthe 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS\n2017, Chicago, IL, USA, May 14-19, 2017, pages 167\u2013182, 2017.\n\n[21] D. Kifer, A. Smith, and A. Thakurta. Private convex empirical risk minimization and high-\n\ndimensional regression. Journal of Machine Learning Research, 1(41):3\u20131, 2012.\n\n[22] G. Li and T. K. Pong. Calculus of the exponent of kurdyka-{\\ L} ojasiewicz inequality and\nits applications to linear convergence of \ufb01rst-order methods. arXiv preprint arXiv:1602.02915,\n2016.\n\n[23] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming,\n\n103(1):127\u2013152, 2005.\n\n[24] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance.\n\nMathematical Programming, 108(1):177\u2013205, 2006.\n\n[25] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in\n\nNeural Information Processing Systems, pages 1574\u20131582, 2014.\n\n[26] B. T. Polyak. Gradient methods for the minimisation of functionals. USSR Computational\n\nMathematics and Mathematical Physics, 3(4):864\u2013878, 1963.\n\n[27] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for\nnonconvex optimization. In International conference on machine learning, pages 314\u2013323,\n2016.\n\n[28] K. Talwar, A. Thakurta, and L. Zhang. Private empirical risk minimization beyond the worst\n\ncase: The effect of the constraint set geometry. arXiv preprint arXiv:1411.5417, 2014.\n\n[29] K. Talwar, A. Thakurta, and L. Zhang. Nearly optimal private lasso. In Advances in Neural\n\nInformation Processing Systems, pages 3025\u20133033, 2015.\n\n[30] A. G. Thakurta and A. Smith. (nearly) optimal algorithms for private online learning in full-\ninformation and bandit settings. In Advances in Neural Information Processing Systems, pages\n2733\u20132741, 2013.\n\n[31] Y.-X. Wang, J. Lei, and S. E. Fienberg. Learning with differential privacy: Stability, learnability\nand the suf\ufb01ciency and necessity of erm principle. Journal of Machine Learning Research,\n17(183):1\u201340, 2016.\n\n[32] X. Wu, M. Fredrikson, W. Wu, S. Jha, and J. F. Naughton. Revisiting differentially private regres-\nsion: Lessons from learning theory and their consequences. arXiv preprint arXiv:1512.06388,\n2015.\n\n[33] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[34] J. Zhang, K. Zheng, W. Mou, and L. Wang. Ef\ufb01cient private ERM for smooth objectives. In\nProceedings of the Twenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI\n2017, Melbourne, Australia, August 19-25, 2017, pages 3922\u20133928, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1544, "authors": [{"given_name": "Di", "family_name": "Wang", "institution": "State University of New York at Buffalo"}, {"given_name": "Minwei", "family_name": "Ye", "institution": "University at Buffalo"}, {"given_name": "Jinhui", "family_name": "Xu", "institution": "SUNY at Buffalo"}]}