{"title": "Smooth Primal-Dual Coordinate Descent Algorithms for Nonsmooth Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 5852, "page_last": 5861, "abstract": "We propose a new randomized coordinate descent method for a convex optimization template  with broad applications. Our analysis relies on a novel combination of four ideas applied to the primal-dual gap function: smoothing, acceleration, homotopy, and coordinate descent with non-uniform sampling. As a result, our method features the first convergence rate guarantees among the coordinate descent methods, that are the best-known under a variety of common structure assumptions on the template. We provide numerical evidence to support the theoretical results with a comparison to state-of-the-art algorithms.", "full_text": "Smooth Primal-Dual Coordinate Descent Algorithms\n\nfor Nonsmooth Convex Optimization\n\nAhmet Alacaoglu1\n\nQuoc Tran-Dinh2\n\nOlivier Fercoq3\n\nVolkan Cevher1\n\n1Laboratory for Information and Inference Systems (LIONS), EPFL, Lausanne, Switzerland\n\n{ahmet.alacaoglu, volkan.cevher}@epfl.ch\n\n2 Department of Statistics and Operations Research, UNC-Chapel Hill, NC, USA\n\nquoctd@email.unc.edu\n\n3 LTCI, T\u00e9l\u00e9com ParisTech, Universit\u00e9 Paris-Saclay, Paris, France\n\nolivier.fercoq@telecom-paristech.fr\n\nAbstract\n\nWe propose a new randomized coordinate descent method for a convex optimization\ntemplate with broad applications. Our analysis relies on a novel combination\nof four ideas applied to the primal-dual gap function: smoothing, acceleration,\nhomotopy, and coordinate descent with non-uniform sampling. As a result, our\nmethod features the \ufb01rst convergence rate guarantees among the coordinate descent\nmethods, that are the best-known under a variety of common structure assumptions\non the template. We provide numerical evidence to support the theoretical results\nwith a comparison to state-of-the-art algorithms.\n\nF (cid:63) = min\nx\u2208Rp\n\nIntroduction\n\n{F (x) = f (x) + g(x) + h(Ax)} ,\n\n1\nWe develop randomized coordinate descent methods to solve the following composite convex problem:\n(1)\nwhere f : Rp \u2192 R, g : Rp \u2192 R \u222a {+\u221e}, and h : Rm \u2192 R \u222a {+\u221e} are proper, closed and\nconvex functions, A \u2208 Rm\u00d7p is a given matrix. The optimization template (1) covers many important\napplications including support vector machines, sparse model selection, logistic regression, etc. It is\nalso convenient to formulate generic constrained convex problems by choosing an appropriate h.\nWithin convex optimization, coordinate descent methods have recently become increasingly popular\nin the literature [1\u20136]. These methods are particularly well-suited to solve huge-scale problems\narising from machine learning applications where matrix-vector operations are prohibitive [1].\nTo our knowledge, there is no coordinate descent method for the general three-composite form (1)\nwithin our structure assumptions studied here that has rigorous convergence guarantees. Our paper\nspeci\ufb01cally \ufb01lls this gap. For such a theoretical development, coordinate descent algorithms require\nspeci\ufb01c assumptions on the convex optimization problems [1, 4, 6]. As a result, to rigorously handle\nthe three-composite case, we assume that (i) f is smooth, (ii) g is non-smooth but decomposable\n(each component has an \u201cef\ufb01ciently computable\u201d proximal operator), and (iii) h is non-smooth.\nOur approach:\nIn a nutshell, we generalize [4, 7] to the three composite case (1). For this purpose,\nwe combine several classical and contemporary ideas: We exploit the smoothing technique in [8],\nthe ef\ufb01cient implementation technique in [4, 14], the homotopy strategy in [9], and the nonuniform\ncoordinate selection rule in [7] in our algorithm, to achieve the best known complexity estimate for\nthe template.\nSurprisingly, the combination of these ideas is achieved in a very natural and elementary primal-dual\ngap-based framework. However, the extension is indeed not trivial since it requires to deal with the\ncomposition of a non-smooth function h and a linear operator A.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWhile our work has connections to the methods developed in [7, 10, 11], it is rather distinct. First,\nwe consider a more general problem (1) than the one in [4, 7, 10]. Second, our method relies on\nNesterov\u2019s accelerated scheme rather than a primal-dual method as in [11]. Moreover, we obtain\nthe \ufb01rst rigorous convergence rate guarantees as opposed to [11]. In addition, we allow using any\nsampling distribution for choosing the coordinates.\nOur contributions: We propose a new smooth primal-dual randomized coordinate descent method\nfor solving (1) where f is smooth, g is nonsmooth, separable and has a block-wise proximal operator,\nand h is a general nonsmooth function. Under such a structure, we show that our algorithm achieves\nthe best known O(n/k) convergence rate, where k is the iteration count and to our knowledge, this is\nthe \ufb01rst time that this convergence rate is proven for a coordinate descent algorithm.\nWe instantiate our algorithm to solve special cases of (1) including the case g = 0 and constrained\nproblems. We analyze the convergence rate guarantees of these variants individually and discuss the\nchoices of sampling distributions.\nExploiting the strategy in [4, 14], our algorithm can be implemented in parallel by breaking up the\nfull vector updates. We also provide a restart strategy to enhance practical performance.\nPaper organization: We review some preliminary results in Section 2. The main contribution of\nthis paper is in Section 3 with the main algorithm and its convergence guarantee. We also present\nspecial cases of the proposed algorithm. Section 4 provides numerical evidence to illustrate the\nperformance of our algorithms in comparison to existing methods. The proofs are deferred to the\nsupplementary document.\n2 Preliminaries\nNotation: Let [n] := {1, 2,\u00b7\u00b7\u00b7 , n} be the set of n positive integer indices. Let us decompose\nthe variable vector x into n-blocks denoted by xi as x = [x1; x2;\u00b7\u00b7\u00b7 ; xn] such that each block xi\ni=1 pi = p. We also decompose the identity matrix Ip of Rp into n\nblock as Ip = [U1, U2,\u00b7\u00b7\u00b7 , Un], where Ui \u2208 Rp\u00d7pi has pi unit vectors. In this case, any vector\ni x for i \u2208 [n]. We\nde\ufb01ne the partial gradients as \u2207if (x) = U(cid:62)\ni \u2207f (x) for i \u2208 [n]. For a convex function f, we use\ndom (f ) to denote its domain, f\u2217(x) := supu\nproxf (x) := arg minu\nX , \u03b4X (\u00b7) denotes its indicator function. We also need the following weighted norms:\n(cid:107)xi(cid:107)2\n(cid:107)x(cid:107)2\n\n(cid:8)u(cid:62)x \u2212 f (u)(cid:9) to denote its Fenchel conjugate, and\n(cid:8)f (u) + (1/2)(cid:107)u \u2212 x(cid:107)2(cid:9) to denote its proximal operator. For a convex set\n[\u03b1] =(cid:80)n\n[\u03b1])2 =(cid:80)n\n\nhas the size pi \u2265 1 with(cid:80)n\nx \u2208 Rp can be written as x = (cid:80)n\n\ni=1 Uixi, and each block becomes xi = U(cid:62)\n\n(i) = (cid:104)Hixi, xi(cid:105),\ni=1 L\u03b1\n\n(i))2 = (cid:104)H\u22121\n\ni yi, yi(cid:105),\ni=1 L\u2212\u03b1\n\ni\n\n((cid:107)yi(cid:107)\u2217\n((cid:107)y(cid:107)\u2217\n\n((cid:107)yi(cid:107)\u2217\n\n(i))2.\n\ni (cid:107)xi(cid:107)2\n(i),\n\n(2)\n\nHere, Hi \u2208 Rpi\u00d7pi is a symmetric positive de\ufb01nite matrix, and Li \u2208 (0,\u221e) for i \u2208 [n] and \u03b1 > 0.\nIn addition, we use (cid:107) \u00b7 (cid:107) to denote (cid:107) \u00b7 (cid:107)2.\nFormal assumptions on the template: We require the following assumptions to tackle (1):\nAssumption 1. The functions f, g and h are all proper, closed and convex. Moreover, they satisfy\n(a) The partial derivative \u2207if (\u00b7) of f is Lipschitz continuous with the Lipschitz constant\n(i) \u2264 \u02c6Li(cid:107)di(cid:107)(i) for all x \u2208 Rp, di \u2208 Rpi.\n\n(b) The function g is separable, which has the following form g(x) =(cid:80)n\n\n\u02c6Li \u2208 [0, +\u221e), i.e., (cid:107)\u2207if (x + Uidi) \u2212 \u2207if (x)(cid:107)\u2217\n\ni=1 gi(xi).\n\n(c) One of the following assumptions for h holds for Subsections 3.3 and 3.4, respectively:\n\ni. h is Lipschitz continuous which is equivalent to the boundedness of dom (h\u2217).\nii. h is the indicator function for an equality constraint, i.e., h(Ax) := \u03b4{c}(Ax).\n\nNow, we brie\ufb02y describe the main techniques used in this paper.\nAcceleration: Acceleration techniques in convex optimization date back to the seminal work of\nNesterov in [13], and is one of standard techniques in convex optimization. We exploit such a scheme\nto achieve the best known O(1/k) rate for the nonsmooth template (1).\nNonuniform distribution: We assume that \u03be is a random index on [n] associated with a probability\ndistribution q = (q1,\u00b7\u00b7\u00b7 , qn)(cid:62) such that\n\nP{\u03be = i} = qi > 0, i \u2208 [n],\n\nand\n\nqi = 1.\n\n(3)\n\nn(cid:88)\n\ni=1\n\n2\n\n\fn for all i \u2208 [n], we obtain the uniform distribution. Let i0, i1,\u00b7\u00b7\u00b7 , ik be i.i.d. realizations\nWhen qi = 1\nof the random index \u03be after k iteration. We de\ufb01ne Fk+1 = \u03c3(i0, i1,\u00b7\u00b7\u00b7 , ik) as the \u03c3-\ufb01eld generated\nby these realizations.\nSmoothing techniques: We can write the convex function h(u) = supy {(cid:104)u, y(cid:105) \u2212 h\u2217(y)} using\nits Fenchel conjugate h\u2217. Since h in (1) is convex but possibly nonsmooth, we smooth h as\n\nh\u03b2(u) := max\ny\u2208Rm\n\n(4)\nwhere \u02d9y \u2208 Rm is given and \u03b2 > 0 is the smoothness parameter. Moreover, the quadratic function\n\u03b2(u), the unique\nb(y, \u02d9y) = 1\nsolution of this concave maximization problem in (4), i.e.:\n\n2(cid:107)y \u2212 \u02d9y(cid:107)2 is de\ufb01ned based on a given norm in Rm. Let us denote by y\u2217\n\n,\n\n(cid:110)(cid:104)u, y(cid:105) \u2212 h\u2217(y) \u2212 \u03b2\n2(cid:107)y \u2212 \u02d9y(cid:107)2(cid:111)\n\n2(cid:107)y \u2212 \u02d9y(cid:107)2(cid:111)\n= prox\u03b2\u22121h\u2217(cid:0) \u02d9y + \u03b2\u22121u(cid:1) ,\n\n(cid:110)(cid:104)u, y(cid:105) \u2212 h\u2217(y) \u2212 \u03b2\n\ny\u2217\n\u03b2(u) := arg max\ny\u2208Rm\n\n(5)\n\nwhere proxh\u2217 is the proximal operator of h\u2217.\nequivalently that dom (h\u2217) is bounded, then it holds that\n\nIf we assume that h is Lipschitz continuous, or\n\nh\u03b2(u) \u2264 h(u) \u2264 h\u03b2(u) + \u03b2D2\nh\u2217\n2\n\n, where Dh\u2217 := max\n\ny\u2208dom(h\u2217)\n\n(cid:107)y \u2212 \u02d9y(cid:107) < +\u221e.\n\n(6)\n\nLet us de\ufb01ne a new smoothed function \u03c8\u03b2(x) := f (x) + h\u03b2(Ax). Then, \u03c8\u03b2 is differentiable, and its\nblock partial gradient\n\n\u2207i\u03c8\u03b2(x) = \u2207if (x) + A(cid:62)\n\ni y\u2217\n\n\u03b2(Ax)\n\n\u03b2\n\nis also Lipschitz continuous with the Lipschitz constant Li(\u03b2) := \u02c6Li +\nAssumption 1, and Ai \u2208 Rm\u00d7pi is the i-th block of A.\nHomotopy:\nIn smoothing-based methods, the choice of the smoothness parameter is critical. This\nchoice may require the knowledge of the desired accuracy, number of maximum iterations or the\ndiameters of the primal and/or dual domains as in [8]. In order to make this choice \ufb02exible and our\nmethod applicable to the constrained problems, we employ a homotopy strategy developed in [9] for\ndeterministic algorithms, to gradually update the smoothness parameter while making sure that it\nconverges to 0.\n3 Smooth primal-dual randomized coordinate descent\nIn this section, we develop a smoothing primal-dual method to solve (1). Or approach is to combine\nthe four key techniques mentioned above: smoothing, acceleration, homotopy, and randomized\ncoordinate descent. Similar to [7] we allow to use arbitrary nonuniform distribution, which may allow\nto design a good distribution that captures the underlying structure of speci\ufb01c problems.\n3.1 The algorithm\nAlgorithm 1 below smooths, accelerates, and randomizes the coordinate descent method.\nAlgorithm 1. SMooth, Accelerate, Randomize The Coordinate Descent (SMART-CD)\nInput: Choose \u03b21 > 0 and \u03b1 \u2208 [0, 1] as two input parameters. Choose x0 \u2208 Rp.\ni )\u03b1 and qi := (B0\n1 Set B0\n2 Set \u03c40 := min{qi | 1 \u2264 i \u2264 n} \u2208 (0, 1] for i \u2208 [n]. Set \u00afx0 = \u02dcx0 := x0.\n\nfor i \u2208 [n]. Compute S\u03b1 :=(cid:80)n\n\nfor all i \u2208 [n].\n\ni := \u02c6Li +\n\ni=1(B0\n\ni )\u03b1\nS\u03b1\n\n(cid:107)Ai(cid:107)2\n\n\u03b21\n\n(cid:107)Ai(cid:107)2\n\n(7)\n, where \u02c6Li is given in\n\nfor k \u2190 0, 1,\u00b7\u00b7\u00b7 , kmax do\nUpdate \u02c6xk := (1 \u2212 \u03c4k)\u00afxk + \u03c4k \u02dcxk and compute \u02c6uk := A\u02c6xk.\nCompute the dual step y\u2217\nSelect a block coordinate ik \u2208 [n] according to the probability distribution q.\nSet \u02dcxk+1 := \u02dcxk, and compute the primal ik-block coordinate:\n\nk+1h\u2217(cid:0) \u02d9y + \u03b2\u22121\n\nk+1 \u02c6uk(cid:1) .\n\n(\u02c6uk) = prox\u03b2\n\nk := y\u2217\n\n\u03b2k+1\n\n\u22121\n\n(cid:110)(cid:104)\u2207ik f (\u02c6xk) + A(cid:62)\n\nik\n\n\u02dcxk+1\nik\n\n:= argmin\nxik\u2208Rpik\n\nk, xik \u2212 \u02c6xk\ny\u2217\n\nik\n\n(cid:105) + gik (xik ) +\n\n\u03c4kBk\nik\n2\u03c40\n\n(cid:107)xik \u2212 \u02dcxk\n\nik\n\n(cid:107)2\n\n(ik)\n\n(cid:111)\n\n.\n\n3\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n(\u02dcxk+1 \u2212 \u02dcxk).\n\nUpdate \u00afxk+1 := \u02c6xk + \u03c4k\n\u03c40\nCompute \u03c4k+1 \u2208 (0, 1) as the unique positive root of \u03c4 3 + \u03c4 2 + \u03c4 2\nUpdate \u03b2k+2 := \u03b2k+1\n1+\u03c4k+1\n\nfor i \u2208 [n].\n\nand Bk+1\n\n:= \u02c6Li +\n\n(cid:107)Ai(cid:107)2\n\u03b2k+2\n\ni\n\nk \u03c4 \u2212 \u03c4 2\n\nk = 0.\n\nend for\n\n3\n\n\f\u03c40\n\nFrom the update \u00afxk := \u02c6xk\u22121 + \u03c4k\u22121\n\nthat \u02c6xk := (1 \u2212 \u03c4k)(cid:0)\u02c6xk\u22121 + \u03c4k\u22121\n\n(\u02dcxk \u2212 \u02dcxk\u22121) and \u02c6xk := (1 \u2212 \u03c4k)\u00afxk + \u03c4k \u02dcxk, it directly follows\n\n(\u02dcxk \u2212 \u02dcxk\u22121)(cid:1) + \u03c4k \u02dcxk. Therefore, it is possible to implement the\n\n\u03c40\n\nalgorithm without forming \u00afxk.\n3.2 Ef\ufb01cient implementation\nWhile the basic variant in Algorithm 1 requires full vector updates at each iteration, we exploit the\nidea in [4, 14] and show that we can partially update these vectors in a more ef\ufb01cient manner.\nAlgorithm 2. Ef\ufb01cient SMART-CD\nInput: Choose a parameter \u03b21 > 0 and \u03b1 \u2208 [0, 1] as two input parameters. Choose x0 \u2208 Rp.\n1 Set B0\n2 Set \u03c40 := min{qi | 1 \u2264 i \u2264 n} \u2208 (0, 1] for i \u2208 [n] and c0 = (1 \u2212 \u03c40). Set u0 = \u02dcz0 := x0.\n\nfor i \u2208 [n]. Compute S\u03b1 :=(cid:80)n\n\ni )\u03b1 and qi := (B0\n\nfor all i \u2208 [n].\n\ni := \u02c6Li +\n\ni=1(B0\n\ni )\u03b1\nS\u03b1\n\n(cid:107)Ai(cid:107)2\n\n\u03b21\n\nfor k \u2190 0, 1,\u00b7\u00b7\u00b7 , kmax do\nCompute the dual step y\u2217\nSelect a block coordinate ik \u2208 [n] according to the probability distribution q.\nLet \u2207k\n\ni := \u2207ik f (ckuk + \u02dczk) + A(cid:62)\n\n(ckAuk + A\u02dczk) := prox\u03b2\n\n\u03b2k+1\n\nk+1h\u2217(cid:0) \u02d9y + \u03b2\u22121\n\n\u22121\n\nk+1(ckAuk + A\u02dczk)(cid:1) .\n(cid:111)\n\n.\n\n(ik)\n\nik\n\n(cid:110)(cid:104)\u2207k\n\n\u03b2k+1\n\ny\u2217\ni , t(cid:105) + gik (t + \u02dczk\n\n(ckAuk + A\u02dczk). Compute\n(cid:107)t(cid:107)2\n\n\u03c4kBk\nik\n2\u03c40\n\n) +\n\nik\n\ntk+1\nik\n\n:= arg min\nt\u2208Rpik\n\n.\n\nik\n\nik\n\n:= \u02dczk\nik\n:= uk\nik\n\n+ tk+1\n\u2212 1\u2212\u03c4k/\u03c40\n\nUpdate \u02dczk+1\nUpdate uk+1\nCompute \u03c4k+1 \u2208 (0, 1) as the unique positive root of \u03c4 3 + \u03c4 2 + \u03c4 2\nUpdate \u03b2k+2 := \u03b2k+1\n1+\u03c4k+1\n\nfor i \u2208 [n].\n\nand Bk+1\n\n:= \u02c6Li +\n\n(cid:107)Ai(cid:107)2\n\u03b2k+2\n\ntk+1\nik\n\nck\n\nik\n\n.\n\ni\n\nk \u03c4 \u2212 \u03c4 2\n\nk = 0.\n\nend for\n\n3\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\nProposition 3.1. Let ck =(cid:81)k\n\nWe present the following result which shows the equivalence between Algorithm 1 and Algorithm 2,\nthe proof of which can be found in the supplementary document.\n\nl=0(1 \u2212 \u03c4l), \u02c6zk = ckuk + \u02dczk and \u00afzk = ck\u22121uk + \u02dczk. Then, \u02dcxk = \u02dczk,\n\n\u02c6xk = \u02c6zk and \u00afxk = \u00afzk, for all k \u2265 0, where \u02dcxk, \u02c6xk, and \u00afxk are de\ufb01ned in Algorithm 1.\nAccording to Algorithm 2, we never need to form or update full-dimensional vectors. Only times\nthat we need \u02c6xk are when computing the gradient and the dual variable y\u2217\n. We present two special\ncases which are common in machine learning, in which we can compute these steps ef\ufb01ciently.\nRemark 3.2. Under the following assumptions, we can characterize the per-iteration complexity\nexplicitly. Let A, M \u2208 Rm\u00d7p, and\n\n\u03b2k+1\n\n(a) f has the form f (x) =(cid:80)m\n\nj=1 \u03c6j(e(cid:62)\n\nj M x), where ej is the jth standard unit vector.\n\n(b) h is separable as in h(Ax) = \u03b4{c}(Ax) or h(Ax) = (cid:107)Ax(cid:107)1.\n\nAssuming that we store and maintain the residuals rk\nu,h = Auk,\n\u02dcz,h = A\u02dczk, then we have the per-iteration cost as O(max{|{j | Aji (cid:54)= 0}|,|{j | Mji (cid:54)= 0}|})\nrk\narithmetic operations. If h is partially separable as in [3], then the complexity of each iteration will\nremain moderate.\n\nu,f = M uk, rk\n\n\u02dcz,f = M \u02dczk, rk\n\n3.3 Case 1: Convergence analysis of SMART-CD for Lipschitz continuous h\nWe provide the following main theorem, which characterizes the convergence rate of Algorithm 1.\nTheorem 3.3. Let x(cid:63) be an optimal solution of (1) and let \u03b21 > 0 be given. In addition, let\n\u03c40 := min{qi | i \u2208 [n]} \u2208 (0, 1] and \u03b20 := (1 + \u03c40)\u03b21 be given parameters. For all k \u2265 1, the\n\nsequence(cid:8)\u00afxk(cid:9) generated by Algorithm 1 satis\ufb01es:\nwhere C\u2217(x0) := (1 \u2212 \u03c40)(F\u03b20(x0) \u2212 F (cid:63)) +(cid:80)n\n\nE(cid:2)F (\u00afxk) \u2212 F (cid:63)(cid:3) \u2264\n\nC\u2217(x0)\n\n\u03c40(k \u2212 1) + 1\n\n+\n\n\u03c40B0\ni\n2qi\n\ni=1\n\n(cid:107)x(cid:63)\n\n,\n\n\u03b21(1 + \u03c40)D2\nh\u2217\n2(\u03c40k + 1)\ni \u2212 x0\ni(cid:107)2\n(i) and Dh\u2217 is as de\ufb01ned by (6).\n\n(8)\n\n4\n\n\fIn the special case when we use uniform distribution, \u03c40 = qi = 1/n, the convergence rate reduces to\n\nwhere C\u2217(x0) := (1 \u2212 1\nconvergence rate of Algorithm 1 is\n\nE(cid:2)F (\u00afxk) \u2212 F (cid:63)(cid:3) \u2264 nC\u2217(x0)\nn )(F\u03b20 (x0) \u2212 F (cid:63)) +(cid:80)n\n(cid:17)\nO(cid:16) n\n\nk + n \u2212 1\n\nB0\ni\n\ni=1\n\n,\n\nk\n\n(n + 1)\u03b20D2\nh\u2217\n\n,\n\n+\n2k + 2n\ni \u2212 x0\ni(cid:107)2\n(i). This estimate shows that the\n\n2 (cid:107)x(cid:63)\n\nwhich is the best known so far to the best of our knowledge.\n3.4 Case 2: Convergence analysis of SMART-CD for non-smooth constrained optimization\nIn this section, we instantiate Algorithm 1 to solve constrained convex optimization problem with\npossibly non-smooth terms in the objective. Clearly, if we choose h(\u00b7) = \u03b4{c}(\u00b7) in (1) as the indicator\nfunction of the set {c} for a given vector c \u2208 Rm, then we obtain a constrained problem:\n\nF (cid:63) := min\nx\u2208Rp\n\n{F (x) = f (x) + g(x) | Ax = c} ,\n\n(9)\n\nwhere f and g are de\ufb01ned as in (1), A \u2208 Rm\u00d7p, and c \u2208 Rm.\nWe can specify Algorithm 1 to solve this constrained problem by modifying the following two steps:\n\n(a) The update of y\u2217\n\n\u03b2k+1\n\n(A\u02c6xk) at Step 5 is changed to\n\ny\u2217\n\n\u03b2k+1\n\n(A\u02c6xk) := \u02d9y + 1\n\n\u03b2k+1\n\n(A\u02c6xk \u2212 c),\n\n(10)\n\nwhich requires one matrix-vector multiplication in A\u02c6xk.\n\n(b) The update of \u03c4k at Step 9 and \u03b2k+1 at Step 10 are changed to\n\n(11)\n\n\u03c4k+1 := \u03c4k\n1+\u03c4k\n\nand \u03b2k+2 := (1 \u2212 \u03c4k+1)\u03b2k+1.\n\nNow, we analyze the convergence of this algorithm by providing the following theorem.\n\n(10) and (11) and let y(cid:63) be an arbitrary optimal solution of the dual problem of (9). In addition,\nlet \u03c40 := min{qi | i \u2208 [n]} \u2208 (0, 1] and \u03b20 := (1 + \u03c40)\u03b21 be given parameters. Then, we have the\nfollowing estimates:\n\nTheorem 3.4. Let(cid:8)\u00afxk(cid:9) be the sequence generated by Algorithm 1 for solving (9) using the updates\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 E(cid:2)F (\u00afxk) \u2212 F (cid:63)(cid:3) \u2264 C\u2217(x0)\nE(cid:2)(cid:107)A\u00afxk \u2212 b(cid:107)(cid:3)\nwhere C\u2217(x0) := (1 \u2212 \u03c40)(F\u03b20(x0) \u2212 F (cid:63)) +(cid:80)n\nlower bound always holds \u2212(cid:107)y(cid:63)(cid:107)E(cid:2)(cid:107)A\u00afxk \u2212 b(cid:107)(cid:3) \u2264 E(cid:2)F (\u00afxk) \u2212 F (cid:63)(cid:3).\n\n2(\u03c40(k\u22121)+1) + (cid:107)y(cid:63)(cid:107)E(cid:2)(cid:107)A\u00afxk \u2212 b(cid:107)(cid:3) ,\n(cid:104)(cid:107)y(cid:63) \u2212 \u02d9y(cid:107) +(cid:0)(cid:107)y(cid:63) \u2212 \u02d9y(cid:107)2 + 2\u03b2\u22121\n\n1 C\u2217(x0)(cid:1)1/2(cid:105)\n\ni \u2212 x0\ni(cid:107)2\n(i). We note that the following\n\n\u03c40(k\u22121)+1 + \u03b21(cid:107)y(cid:63)\u2212 \u02d9y(cid:107)2\n\n\u03c40(k\u22121)+1\n\n\u03c40B0\ni\n2qi\n\n(cid:107)x(cid:63)\n\n(12)\n\n\u2264\n\ni=1\n\n\u03b21\n\n,\n\n3.5 Other special cases\nWe consider the following special cases of Algorithm 1:\nThe case h = 0:\nIn this case, we obtain an algorithm similar to the one studied in [7] except that\nwe have non-uniform sampling instead of importance sampling. If the distribution is uniform, then\nwe obtain the method in [4].\nThe case g = 0:\nIn this case, we have F (x) = f (x) + h(Ax), which can handle the linearly\nconstrained problems with smooth objective function. In this case, we can choose \u03c40 = 1, and the\ncoordinate proximal gradient step, Step 7 in Algorithm 1, is simpli\ufb01ed as\n\n\u02dcxk+1\nik\n\n:= \u02dcxk\nik\n\n\u2212 qik\n\n\u03c4kBk\nik\n\nH\u22121\n\nik\n\n(cid:16)\u2207ik f (\u02c6xk) + A(cid:62)\n\nik\n\ny\u2217\n\n\u03b2k+1\n\n(\u02c6uk)\n\n(cid:17)\n\n.\n\n(13)\n\n(14)\n\nIn addition, we replace Step 8 with\n\u00afxk+1\ni = \u02c6xk\n\ni +\n\nWe then obtain the following results:\n\ni \u2212 \u02dcxk\n(\u02dcxk+1\n\ni ), \u2200i \u2208 [n].\n\n\u03c4k\nqi\n\n5\n\n\fCorollary 3.5. Assume that Assumption 1 holds. Let \u03c40 = 1, \u03b21 > 0 and Step 7 and 8 of Algorithm 1\nbe updated by (13) and (14), respectively. If, in addition, h is Lipschitz continuous, then we have\n\nE(cid:2)F (\u00afxk) \u2212 F (cid:63)(cid:3) \u2264 1\n\nn(cid:88)\n\ni=1\n\nB0\ni\n2q2\ni\n\nk\n\n(cid:107)x(cid:63)\n\ni \u2212 x0\ni(cid:107)2\n(i) +\n\n\u03b21D2\nh\u2217\nk + 1\n\n,\n\n(15)\n\nwhere Dh\u2217 is de\ufb01ned by (6).\nIf, instead of Lipschitz continuous h, we have h(\u00b7) = \u03b4{c}(\u00b7) to solve the constrained problem (9)\nwith g = 0, then we have\n\nk + \u03b21(cid:107)y(cid:63)\u2212 \u02d9y(cid:107)2\n\n(cid:104)(cid:107)y(cid:63) \u2212 \u02d9y(cid:107) +(cid:0)(cid:107)y(cid:63) \u2212 \u02d9y(cid:107)2 + 2\u03b2\u22121\n\n+ (cid:107)y(cid:63)(cid:107)E(cid:2)(cid:107)A\u00afxk \u2212 b(cid:107)(cid:3) ,\n1 C\u2217(x0)(cid:1)1/2(cid:105)\n\n2k\n\n(16)\n\n,\n\n\uf8f1\uf8f2\uf8f3 E(cid:2)F (\u00afxk) \u2212 F (cid:63)(cid:3) \u2264 C\u2217(x0)\nE(cid:2)(cid:107)A\u00afxk \u2212 b(cid:107)(cid:3)\nn(cid:80)\n\n\u2264 \u03b21\n\n(cid:107)x(cid:63)\n\ni \u2212 x0\ni(cid:107)2\n(i).\n\nk\n\nB0\ni\n2q2\ni\n\ni=1\n\nwhere C\u2217(x0) :=\n\n3.6 Restarting SMART-CD\nIt is known that restarting an accelerated method signi\ufb01cantly enhances its practical performance\nwhen the underlying problem admits a (restricted) strong convexity condition. As a result, we describe\nbelow how to restart (i.e., the momentum term) in Ef\ufb01cient SMART-CD. If the restart is injected in\nthe k-th iteration, then we restart the algorithm with the following steps:\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nuk+1 \u2190 0,\nu,f \u2190 0,\nrk+1\nu,h \u2190 0,\nrk+1\n\u2190 y\u2217\n\u02d9y\n\u03b2k+1 \u2190 \u03b21,\n\u03c4k+1 \u2190 \u03c40,\n\u2190 1.\nck\n\n(ckrk\n\nu,h + rk\n\n\u02dcz,h),\n\n\u03b2k+1\n\nThe \ufb01rst three steps of the restart procedure is for restarting the primal variable which is classical\n[15]. Restarting \u02d9y is also suggested in [9]. The cost of this procedure is essentially equal to the cost\nof one iteration as described in Remark 3.2, therefore even restarting once every epoch will not cause\na signi\ufb01cant difference in terms of per-iteration cost.\n4 Numerical evidence\nWe illustrate the performance of Ef\ufb01cient SMART-CD in brain imaging and support vector machines\napplications. We also include one representative example of a degenerate linear program to illustrate\nwhy the convergence rate guarantees of our algorithm matter. We compare SMART-CD with Vu-\nCondat-CD [11], which is a coordinate descent variant of Vu-Condat\u2019s algorithm [16], FISTA [17],\nASGARD [9], Chambolle-Pock\u2019s primal-dual algorithm [18], L-BFGS [19] and SDCA [5].\n4.1 A degenerate linear program: Why do convergence rate guarantees matter?\nWe consider the following degenerate linear program studied in [9]:\n\n(2 \u2264 j \u2264 d),\n\n(17)\n\nxp \u2212(cid:80)p\u22121\n\nk=1 xk = 1,\n\nxp \u2265 0.\n\nk=1 xk = 0,\n\ns.t. (cid:80)p\u22121\n\n2xp\n\nmin\nx\u2208Rp\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\nHere, the constraint xp \u2212(cid:80)p\u22121\n(cid:34)p\u22121(cid:88)\nxk, xp \u2212 p\u22121(cid:88)\n\nf (x) = 2xp,\n\nAx =\n\nwhere\n\nwe can \ufb01t this problem and its dual form into our template (1).\n\nk=1\n\nk=1\n\nk=1\n\n6\n\nconstraint quali\ufb01cation condition, which guarantees the primal-dual optimality. If we de\ufb01ne\n\nk=1 xk = 0 is repeated d times. This problem satis\ufb01es the linear\n\ng(x) = \u03b4{xp\u22650}(xp),\n\nxk, . . . , xp \u2212 p\u22121(cid:88)\n\n(cid:35)(cid:62)\n\nh(Ax) = \u03b4{c}(Ax),\n\nxk\n\n,\n\nc = [1, 0, . . . , 0](cid:62),\n\n\fFigure 1: The convergence behavior of 3 algorithms on a degenerate linear program.\n\nFor this experiment, we select the dimensions p = 10 and d = 200. We implement our algorithm and\ncompare it with Vu-Condat-CD. We also combine our method with the restarting strategy proposed\nabove. We use the same mapping to \ufb01t the problem into the template of Vu-Condat-CD.\nFigure 1 illustrates the convergence behavior of Vu-Condat-CD and SMART-CD. We compare\nprimal suboptimality and feasibility in the plots. The explicit solution of the problem is used to\ngenerate the plot with primal suboptimality. We observe that degeneracy of the problem prevents\nVu-Condat-CD from making any progress towards the solution, where SMART-CD preserves O(1/k)\nrate as predicted by theory. We emphasize that the authors in [11] proved almost sure convergence\nfor Vu-Condat-CD but they did not provide a convergence rate guarantee for this method. Since the\nproblem is certainly non-strongly convex, restarting does not signi\ufb01cantly improve performance of\nSMART-CD.\n4.2 Total Variation and (cid:96)1-regularized least squares regression with functional MRI data\nIn this experiment, we consider a computational neuroscience application where prediction is done\nbased on a sequence of functional MRI images. Since the images are high dimensional and the number\nof samples that can be taken is limited, TV-(cid:96)1 regularization is used to get stable and predictive\nestimation results [20]. The convex optimization problem we solve is of the form:\n\n1\n\n2(cid:107)M x \u2212 b(cid:107)2 + \u03bbr(cid:107)x(cid:107)1 + \u03bb(1 \u2212 r)(cid:107)x(cid:107)TV.\n\nmin\nx\u2208Rp\n\n(18)\n\nThis problem \ufb01ts to our template with\n2(cid:107)M x \u2212 b(cid:107)2,\n\nf (x) = 1\n\ng(x) = \u03bbr(cid:107)x(cid:107)1,\n\nh(u) = \u03bb(1 \u2212 r)(cid:107)u(cid:107)1,\n\nwhere D is the 3D \ufb01nite difference operator to de\ufb01ne a total variation norm (cid:107) \u00b7 (cid:107)TV and u = Dx.\nWe use an fMRI dataset where the primal variable x is 3D image of the brain that contains 33177\nvoxels. Feature matrix M has 768 rows, each representing the brain activity for the corresponding\nexample [20]. We compare our algorithm with Vu-Condat\u2019s algorithm, FISTA, ASGARD, Chambolle-\nPock\u2019s primal-dual algorithm, L-BFGS and Vu-Condat-CD.\n\nFigure 2: The convergence of 7 algorithms for problem (18). The regularization parameters for the\n\ufb01rst plot are \u03bb = 0.001, r = 0.5, for the second plot are \u03bb = 0.001, r = 0.9, for the third plot are\n\u03bb = 0.01, r = 0.5 .\nFigure 2 illustrates the convergence behaviour of the algorithms for different values of the regu-\nlarization parameters. Per-iteration cost of SMART-CD and Vu-Condat-CD is similar, therefore\nthe behavior of these two algorithms are quite similar in this experiment. Since Vu-Condat\u2019s,\n\n7\n\n02004006008001000epoch10-610-410-2100102F(x)-F*SMART-CDSMART-CD-RestartVu-Condat-CD02004006008001000epoch10-610-410-2100102||Ax-c||020406080100time (s)8000850090009500F(x)Chambolle-PockVu-CondatFISTAASGARDL-BFGSVu-Condat-CDSMART-CD020406080100time (s)8000850090009500F(x)020406080100time (s)8000850090009500F(x)\fChambolle-Pock\u2019s, FISTA and ASGARD methods work with full dimensional variables, they have\nslow convergence in time. L-BFGS has a close performance to coordinate descent methods.\nThe simulation in Figure 2 is performed using benchmarking tool of [20]. The algorithms are tuned\nfor the best parameters in practice.\n4.3 Linear support vector machines problem with bias\nIn this section, we consider an application of our algorithm to support vector machines (SVM)\nproblem for binary classi\ufb01cation. Given a training set with m examples {a1, a2, . . . , am} such that\nai \u2208 Rp and class labels {b1, b2, . . . bm} such that bi \u2208 {\u22121, +1}, we de\ufb01ne the soft margin primal\nsupport vector machines problem with bias as\n\nAs it is a common practice, we solve its dual formulation, which is a constrained problem:\n\n0, 1 \u2212 bi((cid:104)ai, w(cid:105) + w0)\n\n(cid:16)\n(cid:8) 1\n2\u03bb(cid:107)M D(b)x(cid:107)2 \u2212(cid:80)m\n\ni=1 xi\n0 \u2264 xi \u2264 Ci, i = 1,\u00b7\u00b7\u00b7 , m,\n\n(cid:17)\n(cid:9)\n\ni=1\n\nmin\nw\u2208Rp\n\nCi max\n\nm(cid:88)\n\uf8f1\uf8f2\uf8f3 min\n(cid:107)M D(b)x(cid:107)2 \u2212 m(cid:88)\n\nx\u2208Rm\n\ns.t.\n\n+ \u03bb\n\n2(cid:107)w(cid:107)2.\n\nb(cid:62)x = 0,\n\n(19)\n\n(20)\n\nwhere D(b) represents a diagonal matrix that has the class labels bi in its diagonal and M \u2208 Rp\u00d7m is\nformed by the example vectors. If we de\ufb01ne\n\nf (x) =\n\n1\n2\u03bb\n\nxi,\n\ngi(xi) = \u03b4{0\u2264xi\u2264Ci},\n\nc = 0, A = b(cid:62),\n\ni=1\n\nthen, we can \ufb01t this problem into our template in (9).\nWe apply the speci\ufb01c version of SMART-CD for constrained setting from Section 3.4 and compare\nwith Vu-Condat-CD and SDCA. Even though SDCA is a state-of-the-art method for SVMs, we are\nnot able to handle the bias term using SDCA. Hence, it only applies to (20) when b(cid:62)x = 0 constraint\nis removed. This causes SDCA not to converge to the optimal solution when there is bias term in the\nproblem (19). The following table summarizes the properties of the classi\ufb01cation datasets we used.\n\nData Set\nrcv1.binary [21, 22]\na8a [21, 23]\ngisette [21, 24]\n\nTraining Size Number of Features Convergence Plot\n20,242\n22,696\n6,000\n\nFigure 3, plot 1\nFigure 3, plot 2\nFigure 3, plot 3\n\n47,236\n123\n5,000\n\nFigure 3 illustrates the performance of the algorithms for solving the dual formulation of SVM in (20).\nWe compute the duality gap for each algorithm and present the results with epochs in the horizontal\naxis since per-iteration complexity of the algorithms is similar. As expected, SDCA gets stuck at\na low accuracy since it ignores one of the constraints in the problem. We demonstrate this fact in\nthe \ufb01rst experiment and then limit the comparison to SMART-CD and Vu-Condat-CD. Equipped\nwith restart strategy, SMART-CD shows the fastest convergence behavior due to the restricted strong\nconvexity of (20).\n\nFigure 3: The convergence of 4 algorithms on the dual SVM (20) with bias. We only used SDCA in\nthe \ufb01rst dataset since it stagnates at a very low accuracy.\n5 Conclusions\nCoordinate descent methods have been increasingly deployed to tackle huge scale machine learning\nproblems in recent years. The most notable works include [1\u20136]. Our method relates to several works\n\n8\n\n100101102epoch10-410-310-210-1100Duality gapSMART-CDSMART-CD-RestartVu-Condat-CDSDCA100101102epoch10-410-310-210-1100Duality gapSMART-CDSMART-CD-RestartVu-Condat-CD100101102epoch10-510-410-310-210-1100Duality gapSMART-CDSMART-CD-RestartVu-Condat-CD\fin the literature including [1, 4, 7, 9, 10, 12]. The algorithms developed in [2\u20134] only considered\na special case of (1) with h = 0, and cannot be trivially extended to apply to general setting (1).\nHere, our algorithm can be viewed as an adaptive variant of the method developed in [4] extended to\nthe sum of three functions. The idea of homotopy strategies relate to [9] for \ufb01rst-order primal-dual\nmethods. This paper further extends such an idea to randomized coordinate descent methods for\nsolving (1). We note that a naive application of the method developed in [4] to the smoothed problem\nwith a carefully chosen \ufb01xed smoothness parameter would result in the complexity O(n2/k), whereas\nusing our homotopy strategy on the smoothness parameter, we reduced this complexity to O(n/k).\nWith additional strong convexity assumption on problem template (1), it is possible to obtain O(1/k2)\nrate by using deterministic \ufb01rst-order primal-dual algorithms [9, 18]. It remains as future work to\nincorporate strong convexity to coordinate descent methods for solving nonsmooth optimization\nproblems with a faster convergence rate.\nAcknowledgments\nThe work of O. Fercoq was supported by a public grant as part of the Investissement d\u2019avenir project,\nreference ANR-11-LABX-0056-LMH, LabEx LMH. The work of Q. Tran-Dinh was partly supported\nby NSF grant, DMS-1619884, USA. The work of A. Alacaoglu and V. Cevher was supported by\nEuropean Research Council (ERC) under the European Union\u2019s Horizon 2020 research and innovation\nprogramme (grant agreement no 725594 - time-data).\n\nReferences\n[1] Y. Nesterov, \u201cEf\ufb01ciency of coordinate descent methods on huge-scale optimization problems,\u201d\n\nSIAM Journal on Optimization, vol. 22, no. 2, pp. 341\u2013362, 2012.\n\n[2] P. Richt\u00e1rik and M. Tak\u00e1\u02c7c, \u201cIteration complexity of randomized block-coordinate descent\nmethods for minimizing a composite function,\u201d Mathematical Programming, vol. 144, no. 1-2,\npp. 1\u201338, 2014.\n\n[3] P. Richt\u00e1rik and M. Tak\u00e1\u02c7c, \u201cParallel coordinate descent methods for big data optimization,\u201d\n\nMathematical Programming, vol. 156, no. 1-2, pp. 433\u2013484, 2016.\n\n[4] O. Fercoq and P. Richt\u00e1rik, \u201cAccelerated, parallel, and proximal coordinate descent,\u201d SIAM\n\nJournal on Optimization, vol. 25, no. 4, pp. 1997\u20132023, 2015.\n\n[5] S. Shalev-Shwartz and T. Zhang, \u201cStochastic dual coordinate ascent methods for regularized\n\nloss minimization,\u201d Journal of Machine Learning Research, vol. 14, pp. 567\u2013599, 2013.\n\n[6] I. Necoara and D. Clipici, \u201cParallel random coordinate descent method for composite mini-\nmization: Convergence analysis and error bounds,\u201d SIAM J. on Optimization, vol. 26, no. 1,\npp. 197\u2013226, 2016.\n\n[7] Z. Qu and P. Richt\u00e1rik, \u201cCoordinate descent with arbitrary sampling i: Algorithms and com-\n\nplexity,\u201d Optimization Methods and Software, vol. 31, no. 5, pp. 829\u2013857, 2016.\n\n[8] Y. Nesterov, \u201cSmooth minimization of non-smooth functions,\u201d Math. Prog., vol. 103, no. 1,\n\npp. 127\u2013152, 2005.\n\n[9] Q. Tran-Dinh, O. Fercoq, and V. Cevher, \u201cA smooth primal-dual optimization framework for\n\nnonsmooth composite convex minimization,\u201d arXiv preprint arXiv:1507.06243, 2015.\n\n[10] O. Fercoq and P. Richt\u00e1rik, \u201cSmooth minimization of nonsmooth functions with parallel\n\ncoordinate descent methods,\u201d arXiv preprint arXiv:1309.5885, 2013.\n\n[11] O. Fercoq and P. Bianchi, \u201cA coordinate descent primal-dual algorithm with large step size and\n\npossibly non separable functions,\u201d arXiv preprint arXiv:1508.04625, 2015.\n\n[12] Y. Nesterov and S.U. Stich, \u201cEf\ufb01ciency of the accelerated coordinate descent method on\nstructured optimization problems,\u201d SIAM J. on Optimization, vol. 27, no. 1, pp. 110\u2013123, 2017.\n[13] Y. Nesterov, \u201cA method for unconstrained convex minimization problem with the rate of\nconvergence O(1/k2),\u201d Doklady AN SSSR, vol. 269, translated as Soviet Math. Dokl., pp. 543\u2013\n547, 1983.\n\n[14] Y. T. Lee and A. Sidford, \u201cEf\ufb01cient accelerated coordinate descent methods and faster algorithms\nfor solving linear systems,\u201d in Foundations of Computer Science (FOCS), 2013 IEEE Annual\nSymp. on, pp. 147\u2013156, IEEE, 2013.\n\n9\n\n\f[15] B. O\u2019Donoghue and E. Candes, \u201cAdaptive restart for accelerated gradient schemes,\u201d Foundations\n\nof computational mathematics, vol. 15, no. 3, pp. 715\u2013732, 2015.\n\n[16] B. C. V\u02dcu, \u201cA splitting algorithm for dual monotone inclusions involving cocoercive operators,\u201d\n\nAdvances in Computational Mathematics, vol. 38, no. 3, pp. 667\u2013681, 2013.\n\n[17] A. Beck and M. Teboulle, \u201cA fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems,\u201d SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183\u2013202, 2009.\n\n[18] A. Chambolle and T. Pock, \u201cA \ufb01rst-order primal-dual algorithm for convex problems with\napplications to imaging,\u201d Journal of mathematical imaging and vision, vol. 40, no. 1, pp. 120\u2013\n145, 2011.\n\n[19] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, \u201cA limited memory algorithm for bound constrained\n\noptimization,\u201d SIAM Journal on Scienti\ufb01c Computing, vol. 16, no. 5, pp. 1190\u20131208, 1995.\n\n[20] E. D. Dohmatob, A. Gramfort, B. Thirion, and G. Varoquaux, \u201cBenchmarking solvers for tv-(cid:96)1\nleast-squares and logistic regression in brain imaging,\u201d in Pattern Recognition in Neuroimaging,\n2014 International Workshop on, pp. 1\u20134, IEEE, 2014.\n\n[21] C.-C. Chang and C.-J. Lin, \u201cLibsvm: a library for support vector machines,\u201d ACM transactions\n\non intelligent systems and technology (TIST), vol. 2, no. 3, p. 27, 2011.\n\n[22] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, \u201cRcv1: A new benchmark collection for text\ncategorization research,\u201d Journal of Machine Learning Research, vol. 5, no. Apr, pp. 361\u2013397,\n2004.\n\n[23] M. Lichman, \u201cUCI machine learning repository,\u201d 2013.\n[24] I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror, \u201cResult analysis of the nips 2003 feature selection\n\nchallenge,\u201d in Advances in neural information processing systems, pp. 545\u2013552, 2005.\n\n[25] P. Tseng, \u201cOn accelerated proximal gradient methods for convex-concave optimization,\u201d Sub-\n\nmitted to SIAM J. Optim, 2008.\n\n10\n\n\f", "award": [], "sourceid": 2998, "authors": [{"given_name": "Ahmet", "family_name": "Alacaoglu", "institution": "EPFL"}, {"given_name": "Quoc", "family_name": "Tran Dinh", "institution": "Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, North Carolina"}, {"given_name": "Olivier", "family_name": "Fercoq", "institution": "Telecom ParisTech"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}