{"title": "Fast, Provably convergent IRLS Algorithm for p-norm Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 14189, "page_last": 14200, "abstract": "Linear regression in L_p-norm is a canonical optimization problem that arises in several applications, including sparse recovery, semi-supervised learning, and signal processing. Generic convex optimization algorithms for solving L_p-regression are slow in practice. Iteratively Reweighted Least Squares (IRLS) is an easy to implement family of algorithms for solving these problems that has been studied for over 50 years. However, these algorithms often diverge for p > 3, and since the work of Osborne (1985), it has been an open problem whether there is an IRLS algorithm that converges for p > 3. We propose p-IRLS, the first IRLS algorithm that provably converges geometrically for any p \\in [2,\\infty). Our algorithm is simple to implement and is guaranteed to find a high accuracy solution in a sub-linear number of iterations. Our experiments demonstrate that it performs even better than our theoretical bounds, beats the standard Matlab/CVX implementation for solving these problems by 10\u201350x, and is the fastest among available implementations in the high-accuracy regime.", "full_text": "Fast, Provably convergent IRLS Algorithm for\n\np-norm Linear Regression \u21e4\n\nDeeksha Adil\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\ndeeksha@cs.toronto.edu\n\nRichard Peng\n\nSchool of Computer Science\n\nGeorgia Institute of Technology\n\nrpeng@cc.gatech.edu\n\nSushant Sachdeva\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nsachdeva@cs.toronto.edu\n\nAbstract\n\nLinear regression in `p-norm is a canonical optimization problem that arises in\nseveral applications, including sparse recovery, semi-supervised learning, and sig-\nnal processing. Generic convex optimization algorithms for solving `p-regression\nare slow in practice. Iteratively Reweighted Least Squares (IRLS) is an easy to\nimplement family of algorithms for solving these problems that has been studied\nfor over 50 years. However, these algorithms often diverge for p > 3, and since the\nwork of Osborne (1985), it has been an open problem whether there is an IRLS\nalgorithm that is guaranteed to converge rapidly for p > 3. We propose p-IRLS,\nthe \ufb01rst IRLS algorithm that provably converges geometrically for any p 2 [2,1).\nOur algorithm is simple to implement and is guaranteed to \ufb01nd a high accuracy\nsolution in a sub-linear number of iterations. Our experiments demonstrate that it\nperforms even better than our theoretical bounds, beats the standard Matlab/CVX\nimplementation for solving these problems by 10\u201350x, and is the fastest among\navailable implementations in the high-accuracy regime.\n\n1\n\nIntroduction\n\nWe consider the problem of `p-norm linear regression (henceforth referred to as `p-regression),\n\narg min\n\nx2Rn kAx  bkp ,\n\n(1)\n\nwhere A 2 Rm\u21e5n, b 2 Rm are given and kvkp =Pi |v i|p1/p denotes the `p-norm. This problem\ngeneralizes linear regression and appears in several applications including sparse recovery [CT05],\nlow rank matrix approximation [CGK+17], and graph based semi-supervised learning [AL11].\nAn important application of `p-regression with p  2 is graph based semi-supervised learning (SSL).\nRegularization using the standard graph Laplacian (also called a 2-Laplacian) was introduced in\nthe seminal paper of Zhu, Gharamani, and Lafferty [ZGL03], and is a popular approach for graph\nbased SSL, see e.g. [ZBL+04, BMN04, CSZ09, Zhu05]. The 2-Laplacian regularization suffers from\ndegeneracy in the limit of small amounts of labeled data [NSZ09]. Several works have since suggested\nusing the p-Laplacian instead [AL11, BZ13, ZB11] with large p, and have established its consistency\nand effectiveness for graph based SSL with small amounts of data [ACR+16, Cal17, RCL19, ST17,\n\n\u21e4Code for this work is available at https://github.com/utoronto-theory/pIRLS.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fKRSS15]. Recently, p-Laplacians have also been used for data clustering and learning problems\n[ETT15, EDT17, HFE18]. Minimizing the p-Laplacian can be easily seen as an `p-regression\nproblem.\nThough `p-regression is a convex programming problem, it is very challenging to solve in practice.\nGeneral convex programming methods such as conic programming using interior-point methods (like\nthose implemented in CVX) are very slow in practice. First order methods do not perform well for\nthese problems with p > 2 since the gradient vanishes rapidly close to the optimum.\nFor applications such graph based SSL with p-Laplacians, it is important that we are able to compute\na solution x that approximates the optimal solution x ? coordinate-wise rather than just achieving an\napproximately optimal objective value, since these coordinates determine the labels for the vertices.\nFor such applications, we seek a (1 + \")-approximate solution, an x such that its objective value,\nkAx  bkp\np , for some very small \" (108\nor so) in order achieve a reasonable coordinate-wise approximation. Hence, it is very important\nfor the dependence on \" be log 1/\" rather than poly(1/\"). A log 1/\" guarantee implies a coordinate-\nwise convergence guarantee with essentially no loss in the asymptotic running time. Please see\nthe supplementary material for the derivation and experimental evaluation of the coordinate-wise\nconvergence guarantees.\n\np is at most (1 + \") times the optimal value kAx ?  bkp\n\nIRLS Algorithms. A family of algorithms for solving the `p-regression problem are the IRLS\n(Iterated Reweighted Least Squares) algorithms. IRLS algorithms have been discovered multiple times\nindependently and have been studied extensively for over 50 years e.g. [Law61, Ric64, Osb85, GR97]\n(see [Bur12] for a detailed survey). The main step in an IRLS algorithm is to solve a weighted least\nsquares (`2-regression) problem to compute the next iterate,\n\nx (t+1) = arg min\n\nx\n\n(Ax  b)>R(t)(Ax  b),\n\n(2)\n\nstarting from any initial solution x (0) (usually the least squares solution corresponding to R = I ).\nEach iteration can be implemented by solving a linear system x (t+1) (A>R(t)A)1A>R(t)b.\nPicking R(t) = diag\u21e3|Ax (t)  b|p2\u2318 , gives us an IRLS algorithm where the only \ufb01xed point is\nthe minimizer of the regression problem (1) (which is unique for p 2 (1,1)).\nThe basic version of the above IRLS algorithm converges reliably in practice for p 2 (1.5, 3), and\ndiverges often even for moderate p (say p  3.5 [RCL19, pg 12]). Osborne [Osb85] proved that the\nabove IRLS algorithm converges in the limit for p 2 [1, 3). Karlovitz [Kar70] proved a similar result\nfor an IRLS algorithm with a line search for even p > 2. However, both these results only prove\nconvergence in the limit without any quantitative bounds, and assume that you start close enough\nto the solution. The question of whether a suitable IRLS algorithm converges geometrically to the\noptimal solution for (1) in a few iterations has been open for over three decades.\n\n\" rather than poly( 1\n\n\" ).\n\np2\n2(p1) log m\n\n\" ) \uf8ff Op(pm log m\n\nOur Contributions. We present p-IRLS, the \ufb01rst IRLS algorithm that provably converges geo-\nmetrically to the optimal solution for `p-regression for all p 2 [2,1). Our algorithm is very similar\nto the standard IRLS algorithm for `p regression, and given an \"> 0, returns a feasible solution x\nfor (1) in Op(m\n\" ) iterations (Theorem 3.1). Here m is the number of\nrows in A. We emphasize that the dependence on \" is log 1\nOur algorithm p-IRLS is very simple to implement, and our experiments demonstrate that it is much\nfaster than the available implementations for p 2 (2,1) in the high accuracy regime. We study its\nperformance on random dense instances of `p-regression, and low dimensional nearest neighbour\ngraphs for p-Laplacian SSL. Our Matlab implementation on a standard desktop machine runs in\nat most 2\u20132.5s (60\u201380 iterations) on matrices for size 1000 \u21e5 850, or graphs with 1000 nodes and\naround 5000 edges, even with p = 50 and \" = 108. Our algorithm is at least 10\u201350x faster than the\nstandard Matlab/CVX solver based on Interior point methods [GB14, GB08], while \ufb01nding a better\nsolution. We also converge much faster than IRLS-homotopy based algorithms [RCL19] that are not\neven guaranteed to converge to a good solution. For larger p, say p > 20, this difference is even more\ndramatic, with p-IRLS obtaining solutions with at least 4 orders of magnitude smaller error with the\nsame number of iterations. Our experiments also indicate that p-IRLS scales much better than as\nindicated by our theoretical bounds, with the iteration count almost unchanged with problem size,\nand growing very slowly (at most linearly) with p.\n\n2\n\n\f1.1 Related Works and Comparison\nIRLS algorithms have been used widely for various problems due to their exceptional simplicity and\nease of implementation, including compressive sensing [CW08], sparse signal reconstruction [GR97],\nand Chebyshev approximation in FIR \ufb01lter design [BB94]. There have been various attempts at\nanalyzing variants of IRLS algorithm for `p-norm minimization. We point the reader to the survey by\nBurrus [Bur12] for numerous pointers and a thorough history.\nThe works of Osborne [Osb85] and Karlovitz [Kar70] mentioned above only prove convergence in\nthe limit without quantitative bounds and under assumptions on p and that we start close enough.\nSeveral works show that it is similar to Newton\u2019s method (e.g. [Kah72, BBS94]), or that adaptive\nstep sizes help (e.g. [VB99, VB12]) but do not prove any guarantees.\nA few notable works prove convergence guarantees for IRLS algorithms for sparse recovery (even\np < 1 in some cases) [DDFG08, DDFG10, BL18], and for low-rank matrix recovery [FRW11].\nQuantitative convergence bounds for IRLS algorithms for `1 are given by Straszak and Vish-\nnoi [SV16b, SV16c, SV16a], inspired by slime-mold dynamics. Ene and Vladu give IRLS algorithms\nfor `1 and `1 [EV19]. However, both these works have poly(1/\") dependence in the number of\niterations, with the best result by [EV19] having a total iteration count roughly m1/3\"2/3.\nThe most relevant theoretical results for `p-norm minimization are Interior point methods [NN94],\nthe homotopy method of Bubeck et al [BCLL18], and the iterative-re\ufb01nement method of Adil et\nal. [AKPS19]. The convergence bounds we prove on the number of iterations required by p-IRLS\np2\n2p2 ) has a better dependence on m than Interior Point methods (roughly m1/2), but\n(roughly m\np2\n2p ) and\nmarginally worse than the dependence in the work of Bubeck et al. [BCLL18] (roughly m\nAdil et al. [AKPS19] (roughly m\n2p+(p2) ). Note that we are comparing the dominant polynomial\nterms, and ignoring the smaller poly(p log m/\") factors. A follow-up work in this line by a subset of the\nauthors [AS19] focuses on the large p case, achieving a similar running time to [AKPS19], but with\nlinear dependence on p. Also related, but not directly comparable are the works of Bullins [Bul18]\n(restricted to p = 4) and the work of Maddison et al. [MPT+18] (\ufb01rst order method with a dependence\non the condition number, which could be large).\nMore importantly, in contrast with comparable second order methods [BCLL18, AKPS19], our\nalgorithm is far simpler to implement, and has a locally greedy structure that allows for greedily\noptimizing the objective using a line search, resulting in much better performance in practice than that\nguaranteed by our theoretical bounds. Unfortunately, there are also no available implementations for\nany of the above discussed methods (other than interior point methods) in order to make a comparison.\nAnother line of heuristic algorithms combines IRLS algorithms with a homotopy based approach\n(e.g. [Kah72]. See [Bur12]). These methods start from a solution for p = 2, and slowly increase p\nmultiplicatively, using an IRLS algorithm for each phase and the previous solution as a starting point.\nThese algorithms perform better in practice than usual IRLS algorithms. However, to the best of our\nknowledge, they are not guaranteed to converge, and no bounds on their performance are known.\nRios [Rio19] provides an ef\ufb01cient implementation of such a method based on the work of Rios et\nal. [RCL19], along with detailed experiments. Our experiments show that our algorithm converges\nmuch faster than the implementation from Rios (see Section 4).\n\np2\n\n2 Preliminaries\n\nWe \ufb01rst de\ufb01ne some terms that we will use in the formal analysis of our algorithm. For our analysis\nwe use a more general form of the `p-regression problem,\n\narg min\n\nx :C x =d kAx  bkp .\n\n(3)\n\nSetting C and d to be empty recovers the standard `p-regression problem.\nDe\ufb01nition 2.1 (Residual Problem). The residual problem of (3) at x is de\ufb01ned as,\n\nmax\n\n:C =0\n\ng>A  2p2>A>RA  ppkAkp\np.\n\nHere R = diag|Ax  b|p2 and g = pR(Ax  b) is the gradient of the objective at x . De\ufb01ne\n\n() to denote the objective of the residual problem evaluated at .\n\n3\n\n\fDe\ufb01nition 2.2 (Approximation to the Residual Problem). Let \uf8ff  1 and ? be the optimum of the\nresidual problem. A \uf8ff-approximate solution to the residual problem is e such that Ce= 0 , and\n(e)  1\n\n\uf8ff (?).\n\n3 Algorithm and Analysis\n\n\"\n\np/16p\n\np < i do\n\n16p(1+\") kAx  bkp\n\nAlgorithm 1 p-IRLS Algorithm\n1: procedure p-IRLS(A, b,\", C , d)\nx arg minC x =d kAx  bk2\n2:\n2 .\ni kAx  bkp\n3:\nwhile\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\n2 i(p2)/pm(p2)/p\n\nR | Ax  b|p2\ng = pR(Ax  b)\ns 1\ne arg ming>A=i/2,C =0 >A>(R + sI )A\n\u21b5 LINESEARCH(A, b, x (t),e)\nx (t+1) x (t)  \u21b5e\nif INSUFFICIENTPROGRESSCHECK(A, R + sI ,e, i) then i i/2\n\nreturn x\n\n. \u21b5 = arg min\u21b5 kA(x  \u21b5e)  bkp\n\np\n\nAlgorithm 2 Check Progress\n1: procedure INSUFFICIENTPROGRESSCHECK(A, R, , i)\n2:\n3:\n\np\n\n2p2>A>RA\n\n 16p\nk ppkAkp\n\u21b50 minn 1\n16 ,\nif (\u21b50 \u00b7e) < \u21b50\n\nelse return false\n\n4:\n\n5:\n6:\n\n1\n\n(16k)1/(p1)o\n\n4 i or >A>(R + sI )A > i/p2 then return true\n\nOur algorithm p-IRLS, described in Algorithm (1), is the standard IRLS algorithm (equation 2) with\nfew key modi\ufb01cations. The \ufb01rst difference is that at each iteration t, we add a small systematic\npadding s(t)I to the weights R(t). The second difference is that the next iterate x (t+1) is calculated\nby performing a line search along the line joining the current iterate x (t) and the standard IRLS iterate\n\nex(t+1) at iteration t+1 (with the modi\ufb01ed weights) 2. Both these modi\ufb01cations have been tried in prac-\n\ntice, but primarily from practical justi\ufb01cations: padding the weights avoids ill-conditioned matrices,\nand line-search can only help us converge faster and improves stability [Kar70, VB99, VB99]. Our\nkey contribution is to show that these modi\ufb01cations together allow us to provably make \u2326p(m p2\n2(p1) )\nprogress towards the optimum, resulting in a \ufb01nal iteration count of Op(m\n\" ). Finally, at\nevery iteration we check if the objective value decreases suf\ufb01ciently, and this allows us to adjust\ns(t) appropriately. We emphasize here that our algorithm always converges. We prove the following\ntheorem:\nTheorem 3.1. Given any A 2 Rm\u21e5n, b 2 Rm,\" > 0, p  2 and x ? = arg minx :C x =d kAx  bkp\np.\nAlgorithm 1 returns x such that kAx  bkp\np and C x = d, in at most\nO\u21e3p3.5m\n\np \uf8ff (1 + \")kAx ?  bkp\n\nThe approximation guarantee on the objective value can be translated to a guarantee on coordinate\nwise convergence. For details on this refer to the supplementary material.\n\n\"\u2318 iterations.\n\n2(p1) log m\n\np2\n2(p1) log m\n\np2\n\n2Note that p-IRLS has been written in a slightly different but equivalent formulation, where it solves for\n\ne= x (t) ex(t+1).\n\n4\n\n\f3.1 Convergence Analysis\n\nThe analysis, at a high level, is based on iterative re\ufb01nement techniques for `p-norms developed in\nthe work of Adil et al [AKPS19] and Kyng et al [KPSW19]. These techniques allow us to use a\ncrude \uf8ff-approximate solver for the residual problem (De\ufb01nition 2.1) Op(\uf8ff log m\n\" ) number of times\nto obtain a (1 + \") approximate solution for the `p-regression problem (Lemma 3.2).\nIn our algorithm, if we had solved the standard weighted `2 problem instead, \uf8ff would be unbounded.\nThe padding added to the weights allow us to prove that the solution to weighted `2 problem gives a\nbounded approximation to the residual problem provided we have the correct padding, or in other\nwords correct value of i (Lemma 3.3). We will show that the number of iterations where we are\nadjusting the value of i are small. Finally, Lemma 3.5 shows that when the algorithm terminates, we\nhave an \"-approximate solution to our main problem. The remaining lemma of this section, Lemma\n3.4 gives the loop invariant which is used at several places in the proof of Theorem 3.1. Due to space\nconstraints, we only state the main lemmas here and defer the proofs to the supplementary material.\nWe begin with the lemma that talks about our overall iterative re\ufb01nement scheme. The iterative\nre\ufb01nement scheme in [AKPS19] and [KPSW19] has an exponential dependence on p. We improve\nthis dependence to a small polynomial in p.\nLemma 3.2.\nStarting from x (0) =\narg minC x =d kAx  bk2\n2, and iterating as, x (t+1) = x (t)  , where  is a \uf8ff-approximate\nsolution to the residual problem (De\ufb01nition 2.1), we get an \"-approximate solution to (3) in at most\n\nLet p  2, and \uf8ff  1.\n\n(Iterative Re\ufb01nement).\n\nO\u21e3p2\uf8ff log m\n\n\"\u2318 calls to a \uf8ff-approximate solver for the residual problem.\n\nThe next lemma talks about bounding the approximation factor \uf8ff, when we have the right value of i.\nLemma 3.3. (Approximation). Let R, g, s,\u21b5 be as de\ufb01ned in lines (5), (6), (7) and (9) of Algorithm\n\n1. Let \u21b50 be as de\ufb01ned in line (4) of Algorithm 2 and e be the solution of the following program,\n2(p1)\u2318- approximate\nIfe>A>(R+sI )Ae \uf8ff i/p2 and (\u21b50\u00b7e)  \u21b50i\n\n4 , then \u21b5\u00b7e is an O\u21e3p1.5m\n\n>A>(R + sI )A s.t. g>A= i/2, C = 0 .\n\nsolution to the residual problem.\n\narg min\n\n\n(4)\n\np2\n\nWe next present the loop invariant followed by the conditions for the termination.\nLemma 3.4.\n(kAx (t)bkp\n\n(Invariant) At every iteration of\n\npkAx ?bkp\np)\n16p\n\n16p(1+\")kAx (0)  bkp\nLemma 3.5. (Termination). Let i be such that (kAx (t)  bkp\n\n\uf8ff i and i \n\np kAx ?  bkp\n\npm(p2)/2.\n\n\"\n\nthe while loop, we have C x (t) = d ,\n\np)/16p 2 (i/2, i]. Then,\n\ni \uf8ff\n\nand,\n\n\"\n\n16p(1+\")kAx (t)  bkp\n\np ) kAx (t)  bkp\n\np \uf8ff (1 + \")OPT.\n\nkAx (t)  bkp\n\np \uf8ff (1 + \")OPT ) i \uf8ff 2\n\n\"\n\n16p(1+\")kAx (t)  bkp\np.\n\nWe next see how Lemmas 3.2,3.3, 3.4, and 3.5 together imply our main result, Theorem 3.1.\n\n3.2 Proof of Theorem 3.1\n\nProof. We \ufb01rst show that at termination, the algorithm returns an \"-approximate solution. We begin\nby noting that the quantity i can only decrease with every iteration. At iteration t, let i0 denote the\nsmallest number such that (kAx (t)  bkp\np)/16p 2 (i0/2, i0]. Note that i must be\nat least i0 (Lemma 3.4). Let us \ufb01rst consider the termination condition of the while loop. When\nwe terminate, \"kAx (t)bkp\np \uf8ff (1 + \")OP T .\nLemma 3.4 also shows that at each iteration our solution satis\ufb01es C x (t) = d, therefore the solution\nreturned at termination also satis\ufb01es the subspace constraints.\n\n16p(1+\")  i  i0. Lemma 3.5 now implies thatAx (t)  b\n\np  kAx ?  bkp\n\np\n\np\n\n5\n\n\f80\n\n60\n\n40\n\n20\n\ns\nn\no\ni\nt\na\nr\ne\nt\nI\n\n0\n\n2\n\n1\n\n)\ns\n(\ne\nm\nT\n\ni\n\np = 4\np = 8\np = 16\np = 32\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n-log10(error)\n\n10\n\n12\n\n(a) Size of A \ufb01xed to 1000\u21e5 850.\n\ns\nn\no\n\ni\nt\n\na\nr\ne\n\nt\nI\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n)\ns\n(\ne\nm\nT\n\ni\n\n2\n\n1\n\n0\n\n0\n\np = 4\np = 8\np = 16\np = 32\n\n200\n\n400\n\nSize\n\n600\n\n800\n\n1000\n\n(b) Sizes of A: (50 + 100(k \n1)) \u21e5 100k. Error \" = 108.\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\ns\nn\no\n\ni\nt\n\na\nr\ne\n\nt\nI\n\n0\n\n2\n\n1\n\n)\ns\n(\ne\nm\nT\n\ni\n\n0\n\n0\n\n10\n\n20\n30\nParameter p\n\n40\n\n50\n\n(c) Size of A is \ufb01xed to 1000 \u21e5\n850. Error \" = 108.\n\nFigure 2: Random Matrix instances. Comparing the number of iterations and time taken by our algorithm with\nthe parameters. Averaged over 100 random samples for A and b. Linear solver used : backslash.\n\nWe next prove the running time bound. Note that the objective is non increasing with every iteration.\n\np2\n\np2\n\np, which could also be zero.\n\nalgorithm does not reduce i. It suf\ufb01ces to prove that in this iteration, the algorithm obtains an\n\nThis is because the LINESEARCH returns a factor that minimizes the objective given a direction e,\ni.e., \u21b5 = arg min kA(x  e)  bkp\nWe now show that at every iteration the algorithm either reduces i or \ufb01nds e that gives a\nO\u21e3p1.5m\n2(p1)\u2318-approximate solution to the residual problem. Consider an iteration where the\nO\u21e3p1.5m\n2(p1)\u2318-approximate solution to the residual problem. Since the algorithm does not reduce\ni, we must have (\u21b50e)  \u21b50i/4, and e>A>(R + sI )Ae \uf8ff i/p2. It follows from Lemma 3.3,\nwe know that e gives the required approximation to the residual problem.\nThus, the algorithm either reduces i or returns an O\u21e3p1.5m\n2(p1)\u2318-approximate solution to the\n\" (Lemma 3.4 gives the value of imin). By Lemma 3.2, the number of steps where the\np log m\nmost O\u21e3p3.5m\n\"\u2318. Thus, the total number of iterations required by our algorithm is\nO\u21e3p3.5m\n2(p1) log m\n\n\"\u2318, completing the proof of the theorem.\n\nresidual problem. The number of steps in which we reduce i is at most log(iinitial/imin) =\n\nalgorithm \ufb01nds an approximate solution before it has found a (1 + \")-approximate solution is at\n\np2\n\n2(p1) log m\n\np2\n\np2\n\n4 Experiments\n\nIn this section, we detail our results from experiments studying\nthe performance of our algorithm, p-IRLS. We implemented our\nalgorithm in Matlab on a standard desktop machine, and evaluated\nits performance on two types of instances, random instances for\n`p-regression, and graphs for p-Laplacian minimization. We\nstudy the scaling behavior of our algorithm as we change p, \",\nand the size of the problem. We compare our performance to the\nMatlab/CVX solver that is guaranteed to \ufb01nd a good solution,\nand to the IRLS/homotopy based implementation from [RCL19]\nthat is not guaranteed to converge, but runs quite well in practice.\nWe now describe our instances, parameters and experiments in\ndetail.\n\nError : IRLS vs Newton with Homotopy\n\nIRLS\nNewton\n\nr\no\nr\nr\n\nE\n\n10-2\n\n10-4\n\n10-6\n\n10-8\n\n0\n\n10\n\n30\n\n20\nParameter p\n\n40\n\n50\n\nFigure 1: Averaged over 100 ran-\ndom samples. Graph: 1000 nodes\n(5000-6000 edges). Solver: PCG\nwith Cholesky preconditioner.\n\n6\n\n\fp = 4\np = 8\np = 16\np = 32\n\n120\n\n100\n\ns\nn\no\ni\nt\na\nr\ne\nt\nI\n\n)\ns\n(\ne\nm\nT\n\ni\n\n80\n\n60\n\n40\n\n20\n\n0\n\n2\n\n1\n\ns\nn\no\n\ni\nt\n\na\nr\ne\n\nt\nI\n\n80\n\n60\n\n40\n\n20\n\n0\n\n)\ns\n(\ne\nm\n1T\n\ni\n\np = 4\np = 8\np = 16\np = 32\n\n80\n\n60\n\n40\n\n20\n\ns\nn\no\ni\nt\na\nr\ne\nt\nI\n\n0\n\n2\n\n1\n\n)\ns\n(\ne\nm\nT\n\ni\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n-log10(error)\n\n10\n\n12\n\n0\n\n0\n\n200\n\n400\n\nSize\n\n600\n\n800\n\n1000\n\n0\n\n0\n\n10\n\n20\n30\nParameter p\n\n40\n\n50\n\n(a) Size of graph \ufb01xed to 1000\nnodes (around 5000-6000 edges).\n\n(b) Number of nodes: 100k. Error\n\" = 108.\n\n(c) Size of graph \ufb01xed to 1000\nnodes (around 5000-6000 edges).\nError \" = 108.\n\nFigure 3: Graph Instances. Comparing the number of iterations and time taken by our algorithm with the\nparameters. Averaged over 100 graph samples. Linear solver used : backslash.\n\nIRLS vs CVX : Random Matrices\n\nIRLS\nCVX\n\n80\n\n60\n\n40\n\n20\n\n)\ns\n(\ne\nm\nT\n\ni\n\nIRLS vs CVX : Random Matrices\n\nIRLS\nCVX\n\n30\n\n20\n\n10\n\n)\ns\n(\ne\nm\nT\n\ni\n\nIRLS vs CVX : Graphs\nIRLS\nCVX\n\n15\n\n10\n\n5\n\n)\ns\n(\ne\nm\nT\n\ni\n\n2.2\n\n0\n\n200\n\n400\n\n600\n\nSize\n\n800\n\n1000\n\n0.3\n\n0\n\n2\n\n6\n\n4\nParameter p\n\n8\n\n10\n\n0.5\n\n0\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nSize\n\n)\ns\n(\ne\nm\nT\n\ni\n\n25\n20\n15\n10\n5\n0.4\n\nIRLS vs CVX : Graphs\nIRLS\nCVX\n\n0\n\n2\n\n6\n\n4\nParameter p\n\n8\n\n10\n\n(a) Fixed p = 8. Size of\nmatrices: 100k \u21e5 (50 +\n100(k  1)).\nFigure 4: Averaged over 100 samples. Precision set to \" = 108.CVX solver used : SDPT3 for Matrices and\nSedumi for Graphs.\n\n(c) Fixed p = 8.\nThe number of nodes :\n50k, k = 1, 2, ..., 10.\n\n(d) Size of graphs \ufb01xed to\n400 nodes ( around 2000\nedges).\n\n(b) Size of matrices \ufb01xed\nto 500 \u21e5 450.\n\nInstances and Parameters. We consider two types of in-\nstances, random matrices and graphs.\n1. Random Matrices: We want to solve the problem minx kAx  bkp. In these instances we\nuse random matrices A and b, where every entry of the matrix is chosen uniformly at random\nbetween 0 and 1.\n\n2. Graphs: We use the graphs described in [RCL19]. The set of vertices is generated by choosing\nvectors in [0, 1]10 uniformly at random and the edges are created by connecting the 10 nearest\nneighbours. Weights of each edge is speci\ufb01ed by a gaussian type function (Eq 3.1,[RCL19]).\nVery few vertices (around 10) have labels which are again chosen uniformly at random between\n0 and 1. The problem studied on these instances is to determine the minimizer of the `p laplacian.\nWe formulate this problem into the form minx kAx  bkp\np, details of this formulation can be\nfound in the Appendix that is in the supplementary material.\n\nNote that we have 3 different parameters for each problem, the size of the instance i.e., the number\nof rows of matrix A, the norm we solve for, p, and the accuracy to which we want to solve each\nproblem, \". We will consider each of these parameters independently and see how our algorithm\nscales with them for both instances.\n\nBenchmark Comparisons. We compare the performance of our program with the following:\n\n1. Standard MATLAB optimization package, CVX [GB14, GB08].\n\n7\n\n\f2. The most ef\ufb01cient algorithm for `p-semi supervised learning given in [RCL19] was newton\u2019s\nmethod with homotopy. We take their hardest problem, and compare the performance of their\ncode with ours by running our algorithm for the same number of iterations as them and showing\nthat we get closer to the optimum, or in other words a smaller error \", thus showing we converge\nmuch faster.\n\nImplementation Details. We normalize the instances by running our algorithm once and dividing\nthe vector b by the norm of the \ufb01nal objective, so that our norms at the end are around 1. We do\nthis for every instance before we measure the runtime or the iteration count for uniformity and to\navoid numerical precision issues. All experiments were performed on MATLAB 2018b on a Desktop\nubuntu machine with an Intel Core i5-4570 CPU @ 3.20GHz \u21e5 4 processor and 4GB RAM. For\nthe graph instances, we \ufb01x the dimension of the space from which we choose vertices to 10 and the\nnumber of labelled vertices to be 10. The graph instances are generated using the code [Rio19] by\n[RCL19]. Other details speci\ufb01c to the experiment are given in the captions.\n\n4.1 Experimental Results\n\nDependence on Parameters. Figure 2 shows the dependence of the number of iterations and\nruntime on our parameters for random matrices. Similarly for graph instances, Figure 3 shows the\ndependence of iteration count and runtime with the parameters. As expected from the theoretical\n\nguarantees, the number of iterations and runtimes increase linearly with log 1\n\nsize and p are clearly much better in practice (nearly constant and at most linear respectively) than\nthe theoretical bounds (m1/2 and p3.5 respectively) for both kinds of instances.\n\n\". The dependence on\n\nComparisons with Benchmarks.\n\n\u2022 Figure 4 shows the runtime comparison between our IRLS algorithm p-IRLS and CVX. For all\ninstances, we ensured that our \ufb01nal objective was smaller than the objective of the CVX solver.\nAs it is clear for both kinds of instances, our algorithm takes a lot lesser time and also increases\nmore slowly with size and p as compared to CVX. Note that that CVX does a lot better when\np = 2k, but it is still at least 30-50 times slower for random matrices and 10-30 times slower for\ngraphs.\n\n\u2022 Figure 1 shows the performance of our algorithm when compared to the IRLS/Homotopy method\nof [RCL19]. We use the same linear solvers for both programs, preconditioned conjugate\ngradient with an incomplete cholesky preconditioner and run both programs to the same number\nof iterations. The plots indicate the value \" as described previously. For our IRLS algorithm we\nindicate our upper bound on \" and for their procedure we indicate a lower bound on \" which\nis the relative difference in the objectives achieved by the two algorithms. It is clear that our\nalgorithm achieves an error that is orders of magnitudes smaller than the error achieved by their\nalgorithm. This shows that our algorithm has a much faster rate of convergence. Note that there\nis no guarantee on the convergence of the method used by [RCL19], whereas we prove that our\nalgorithm converges in a small number of iterations.\n\n5 Discussion\n\nTo conclude, we present p-IRLS, the \ufb01rst IRLS algorithm that provably converges to a high accuracy\nsolution in a small number of iterations. This settles a problem that has been open for over three\ndecades. Our algorithm is very easy to implement and we demonstrate that it works very well in\npractice, beating the standard optimization packages by large margins. The theoretical bound on the\nnumbers of iterations has a sub-linear dependence on size and a small polynomial dependence on p,\nhowever in practice, we see an almost constant dependence on size and at most linear dependence on\np in random instances and graphs. In order to achieve the best theoretical bounds we would require\nsome form of acceleration. For `1 and `1 regression, it has been shown that it is possible to achieve\nacceleration, however without geometric convergence. It remains an open problem to give a practical\nIRLS algorithm which simultaneously has the best possible theoretical convergence bounds.\n\n8\n\n\fAcknowledgements\n\nDA is supported by SS\u2019s NSERC Discovery grant and an Ontario Graduate Scholarship. SS is sup-\nported by the Natural Sciences and Engineering Research Council of Canada (NSERC), a Connaught\nNew Researcher award, and a Google Faculty Research award. RP is partially supported by the NSF\nunder Grants No. 1637566 and No. 1718533.\n\nReferences\n[ACR+16] A. E. Alaoui, X. Cheng, A. Ramdas, M. J. Wainwright, and M. I. Jordan. Asymptotic\nbehavior of `p-based Laplacian regularization in semi-supervised learning. In Vitaly\nFeldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on\nLearning Theory, volume 49 of Proceedings of Machine Learning Research, pages\n879\u2013906, Columbia University, New York, New York, USA, 23\u201326 Jun 2016. PMLR.\n\n[AKPS19] D. Adil, R. Kyng, R. Peng, and S. Sachdeva. Iterative re\ufb01nement for `p-norm regression.\nIn Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms,\nSODA 2019, San Diego, California, USA, January 6-9, 2019, pages 1405\u20131424, 2019.\n\n[AL11] M. Alamgir and U. V. Luxburg. Phase transition in the family of p-resistances. In\nJ. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Ad-\nvances in Neural Information Processing Systems 24, pages 379\u2013387. Curran Associates,\nInc., 2011.\n\n[AS19] D. Adil and S. Sachdeva. Faster p-norm minimizing \ufb02ows, via smoothed q-norm\nproblems. arXiv e-prints, page arXiv:1910.10571, Oct 2019. To appear at ACM-SIAM\nSymposium on Discrete Algorithms (SODA 2020).\n\n[BB94] J. A. Barreto and C. S. Burrus. lp complex approximation using iterative reweighted\nleast squares for \ufb01r digital \ufb01lters. In Proceedings of ICASSP\u201994. IEEE International\nConference on Acoustics, Speech and Signal Processing, volume 3, pages III\u2013545. IEEE,\n1994.\n\n[BBS94] C.S. Burrus, J.A. Barreto, and I.W. Selesnick. Iterative reweighted least-squares design\n\nof \ufb01r \ufb01lters. Trans. Sig. Proc., 42(11):2926\u20132936, November 1994.\n\n[BCLL18] S. Bubeck, M. B. Cohen, Y. T. Lee, and Y. Li. An homotopy method for lp regression\nprovably beyond self-concordance and in input-sparsity time. In Proceedings of the\n50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages\n1130\u20131137, New York, NY, USA, 2018. ACM.\n\n[BL18] N. Bi and K. Liang. Iteratively reweighted algorithm for signals recovery with coherent\ntight frame. Mathematical Methods in the Applied Sciences, 41(14):5481\u20135492, 2018.\n\n[BMN04] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on\nlarge graphs. In John Shawe-Taylor and Yoram Singer, editors, Learning Theory, pages\n624\u2013638, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.\n\n[Bul18] B. Bullins.\n\nFast minimization of structured convex quartics.\n\narXiv:1812.10349, 2018.\n\narXiv preprint\n\n[Bur12] C. S. Burrus.\n\nIterative reweighted least squares. OpenStax CNX. Available online:\n\nhttp://cnx. org/contents/92b90377-2b34-49e4-b26f-7fe572db78a1, 12, 2012.\n\n[BZ13] N. Bridle and X. Zhu. p-voltages: Laplacian regularization for semi-supervised learning\non high-dimensional data. In Eleventh Workshop on Mining and Learning with Graphs\n(MLG2013), 2013.\n\n[Cal17] J. Calder. Consistency of lipschitz learning with in\ufb01nite unlabeled data and \ufb01nite labeled\n\ndata. CoRR, abs/1710.10364, 2017.\n\n9\n\n\f[CGK+17] F. Chierichetti, S. Gollapudi, R. Kumar, S. Lattanzi, R. Panigrahy, and D. P. Woodruff.\nAlgorithms for `p low-rank approximation. In D. Precup and Y. W. Teh, editors, Pro-\nceedings of the 34th International Conference on Machine Learning, volume 70 of\nProceedings of Machine Learning Research, pages 806\u2013814, International Convention\nCentre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[CSZ09] O. Chapelle, B. Scholkopf, and A. Zien, Eds. Semi-supervised learning (chapelle, o. et\nal., eds.; 2006) [book reviews]. IEEE Transactions on Neural Networks, 20(3):542\u2013542,\nMarch 2009.\n\n[CT05] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on\n\nInformation Theory, 51(12):4203\u20134215, Dec 2005.\n\n[CW08] R. Chartrand and Wotao Yin. Iteratively reweighted algorithms for compressive sensing.\nIn 2008 IEEE International Conference on Acoustics, Speech and Signal Processing,\npages 3869\u20133872, March 2008.\n\n[DDFG08] I. Daubechies, R. DeVore, M. Fornasier, and S. Gunturk. Iteratively re-weighted least\nsquares minimization: Proof of faster than linear rate for sparse recovery. In 2008 42nd\nAnnual Conference on Information Sciences and Systems, pages 26\u201329, March 2008.\n\n[DDFG10] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Gunturk. Iteratively reweighted\nleast squares minimization for sparse recovery. Communications on Pure and Applied\nMathematics, 63(1):1\u201338, 2010.\n\n[EDT17] A Elmoataz, X Desquesnes, and M Toutain. On the game p-laplacian on weighted graphs\nwith applications in image processing and data clustering. European Journal of Applied\nMathematics, 28(6):922\u2013948, 2017.\n\n[ETT15] A. Elmoataz, M. Toutain, and D. Tenbrinck. On the p-laplacian and 1-laplacian on\ngraphs with applications in image and data processing. SIAM Journal on Imaging\nSciences, 8(4):2412\u20132451, 2015.\n\n[EV19] A. Ene and A. Vladu. Improved convergence for `1 and `1 regression via iteratively\nreweighted least squares. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,\nProceedings of the 36th International Conference on Machine Learning, volume 97 of\nProceedings of Machine Learning Research, pages 1794\u20131801, Long Beach, California,\nUSA, 09\u201315 Jun 2019. PMLR.\n\n[FRW11] M. Fornasier, H. Rauhut, and R. Ward. Low-rank matrix recovery via iteratively\nreweighted least squares minimization. SIAM Journal on Optimization, 21(4):1614\u2013\n1640, 2011.\n\n[GB08] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In\nV. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control,\nLecture Notes in Control and Information Sciences, pages 95\u2013110. Springer-Verlag\nLimited, 2008. http://stanford.edu/~boyd/graph_dcp.html.\n\n[GB14] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming,\n\nversion 2.1. http://cvxr.com/cvx, March 2014.\n\n[GR97] I. F. Gorodnitsky and B. D. Rao. Sparse signal reconstruction from limited data using fo-\ncuss: a re-weighted minimum norm algorithm. IEEE Transactions on Signal Processing,\n45(3):600\u2013616, March 1997.\n\n[HFE18] Y. Ha\ufb01ene, J. Fadili, and A. Elmoataz. Nonlocal p-laplacian variational problems on\n\ngraphs. arXiv preprint arXiv:1810.12817, 2018.\n\n[Kah72] S. W. Kahng. Best lp-approximation. Math. Comput., 26(118):505\u2013508, 1972.\n[Kar70] L.A Karlovitz. Construction of nearest points in the lp, p even, and l1 norms. i. Journal\n\nof Approximation Theory, 3(2):123 \u2013 127, 1970.\n\n10\n\n\f[KPSW19] R. Kyng, R. Peng, S. Sachdeva, and D. Wang. Flows in almost linear time via adaptive\npreconditioning. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory\nof Computing, STOC 2019, pages 902\u2013913, New York, NY, USA, 2019. ACM.\n\n[KRSS15] R. Kyng, A. Rao, S. Sachdeva, and D. A. Spielman. Algorithms for lipschitz learning on\ngraphs. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris,\nFrance, July 3-6, 2015, pages 1190\u20131223, 2015.\n\n[Law61] C. L. Lawson. Contribution to the theory of linear least maximum approximation. Ph.D.\n\ndissertation, Univ. Calif., 1961.\n\n[MPT+18] C. J Maddison, D. Paulin, Y. W. Teh, B. O\u2019Donoghue, and A. Doucet. Hamiltonian\n\ndescent methods. arXiv preprint arXiv:1809.05042, 2018.\n\n[NN94] Y. Nesterov and A. Nemirovskii.\n\nInterior-Point Polynomial Algorithms in Convex\n\nProgramming. Society for Industrial and Applied Mathematics, 1994.\n\n[NSZ09] Boaz N., Nathan S., and Xueyuan Z. Statistical analysis of semi-supervised learning:\n\nThe limit of in\ufb01nite unlabelled data. 2009.\n\n[Osb85] M. R. Osborne. Finite Algorithms in Optimization and Data Analysis. John Wiley &\n\nSons, Inc., New York, NY, USA, 1985.\n\n[RCL19] M. F. Rios, J. Calder, and G. Lerman. Algorithms for `p-based semi-supervised learning\n\non graphs. CoRR, abs/1901.05031, 2019.\n\n[Ric64] J.R. Rice. The Approximation of Functions, By John R. Rice. Addison-Wesley Series in\n\nComputer Science and Information Processing. 1964.\n\n[Rio19] M. F. Rios. Laplacian_lp_graph_ssl. https://github.com/mauriciofloresML/\n\nLaplacian_Lp_Graph_SSL, 2019.\n\n[ST17] D. Slepcev and M. Thorpe. Analysis of p-laplacian regularization in semi-supervised\n\nlearning. CoRR, abs/1707.06213, 2017.\n\n[SV16a] D. Straszak and N. K. Vishnoi. IRLS and slime mold: Equivalence and convergence.\n\nCoRR, abs/1601.02712, 2016.\n\n[SV16b] D. Straszak and N. K. Vishnoi. Natural algorithms for \ufb02ow problems. In Proceedings of\nthe Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016,\nArlington, VA, USA, January 10-12, 2016, pages 1868\u20131883, 2016.\n\n[SV16c] D. Straszak and N. K. Vishnoi. On a natural dynamics for linear programming. In\nProceedings of the 2016 ACM Conference on Innovations in Theoretical Computer\nScience, Cambridge, MA, USA, January 14-16, 2016, page 291, 2016.\n\n[VB99] R. A. Vargas and C. S. Burrus. Adaptive iterative reweighted least squares design of\nlp \ufb01r \ufb01lters. In 1999 IEEE International Conference on Acoustics, Speech, and Signal\nProcessing. Proceedings. ICASSP99 (Cat. No.99CH36258), volume 3, pages 1129\u20131132\nvol.3, March 1999.\n\n[VB12] R. A. Vargas and C. S. Burrus. Iterative design of lp digital \ufb01lters. CoRR, abs/1207.4526,\n\n2012.\n\n[ZB11] X. Zhou and M. Belkin. Semi-supervised learning by higher order regularization. In\nGeoffrey Gordon, David Dunson, and Miroslav Dud\u00edk, editors, Proceedings of the\nFourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, volume 15\nof Proceedings of Machine Learning Research, pages 892\u2013900, Fort Lauderdale, FL,\nUSA, 11\u201313 Apr 2011. PMLR.\n\n[ZBL+04] D. Zhou, O. Bousquet, TN. Lal, J. Weston, and B. Sch\u00f6lkopf. Learning with local and\nglobal consistency. In Advances in Neural Information Processing Systems 16, pages\n321\u2013328, Cambridge, MA, USA, June 2004. Max-Planck-Gesellschaft, MIT Press.\n\n11\n\n\f[ZGL03] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian\n\ufb01elds and harmonic functions. In Proceedings of the 20th International conference on\nMachine learning (ICML-03), pages 912\u2013919, 2003.\n\n[Zhu05] X. J. Zhu. Semi-supervised learning literature survey. Technical report, University of\n\nWisconsin-Madison Department of Computer Sciences, 2005.\n\n12\n\n\f", "award": [], "sourceid": 7941, "authors": [{"given_name": "Deeksha", "family_name": "Adil", "institution": "University of Toronto"}, {"given_name": "Richard", "family_name": "Peng", "institution": "Georgia Tech"}, {"given_name": "Sushant", "family_name": "Sachdeva", "institution": "University of Toronto"}]}