{"title": "Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 2674, "page_last": 2682, "abstract": "Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we useour new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.", "full_text": "Taming the Wild: A Uni\ufb01ed Analysis of\n\nHOGWILD!-Style Algorithms\n\nChristopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00b4e\n\ncdesa@stanford.edu, czhang@cs.wisc.edu,\n\nkunle@stanford.edu, chrismre@stanford.edu\nDepartments of Electrical Engineering and Computer Science\n\nStanford University, Stanford, CA 94309\n\nAbstract\n\nStochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of ma-\nchine learning problems. Researchers and industry have developed several tech-\nniques to optimize SGD\u2019s runtime performance, including asynchronous execu-\ntion and reduced precision. Our main result is a martingale-based analysis that\nenables us to capture the rich noise models that may arise from such techniques.\nSpeci\ufb01cally, we use our new analysis in three ways: (1) we derive convergence\nrates for the convex case (HOGWILD!) with relaxed assumptions on the sparsity\nof the problem; (2) we analyze asynchronous SGD algorithms for non-convex\nmatrix problems including matrix completion; and (3) we design and analyze\nan asynchronous SGD algorithm, called BUCKWILD!, that uses lower-precision\narithmetic. We show experimentally that our algorithms run ef\ufb01ciently for a vari-\nety of problems on modern hardware.\n\n1\n\nIntroduction\n\nMany problems in machine learning can be written as a stochastic optimization problem\n\nminimize E[ \u02dcf (x)] over x \u2208 Rn,\n\nxt+1 = xt \u2212 \u03b1\u2207 \u02dcft(xt),\n\nwhere \u02dcf is a random objective function. One popular method to solve this is with stochastic gradient\ndescent (SGD), an iterative method which, at each timestep t, chooses a random objective sample \u02dcft\nand updates\n\n(1)\nwhere \u03b1 is the step size. For most problems, this update step is easy to compute, and perhaps\nbecause of this SGD is a ubiquitous algorithm with a wide range of applications in machine learn-\ning [1], including neural network backpropagation [2, 3, 13], recommendation systems [8, 19], and\noptimization [20]. For non-convex problems, SGD is popular\u2014in particular, it is widely used in\ndeep learning\u2014but its success is poorly understood theoretically.\nGiven SGD\u2019s success in industry, practitioners have developed methods to speed up its computation.\nOne popular method to speed up SGD and related algorithms is using asynchronous execution.\nIn an asynchronous algorithm, such as HOGWILD! [17], multiple threads run an update rule such\nas Equation 1 in parallel without locks. HOGWILD! and other lock-free algorithms have been\napplied to a variety of uses, including PageRank approximations (FrogWild! [16]), deep learning\n(Dogwild! [18]) and recommender systems [24]. Many asynchronous versions of other stochastic\nalgorithms have been individually analyzed, such as stochastic coordinate descent (SGD) [14, 15]\nand accelerated parallel proximal coordinate descent (APPROX) [6], producing rate results that are\nsimilar to those of HOGWILD! Recently, Gupta et al. [9] gave an empirical analysis of the effects of\na low-precision variant of SGD on neural network training. Other variants of stochastic algorithms\n\n1\n\n\fhave been proposed [5, 11, 12, 21, 22, 23]; only a fraction of these algorithms have been analyzed in\nthe asynchronous case. Unfortunately, a new variant of SGD (or a related algorithm) may violate the\nassumptions of existing analysis, and hence there are gaps in our understanding of these techniques.\nOne approach to \ufb01lling this gap is to analyze each purpose-built extension from scratch: an entirely\nnew model for each type of asynchrony, each type of precision, etc. In a practical sense, this may\nbe unavoidable, but ideally there would be a single technique that could analyze many models. In\nthis vein, we prove a martingale-based result that enables us to treat many different extensions as\ndifferent forms of noise within a uni\ufb01ed model. We demonstrate our technique with three results:\n\n1. For the convex case, HOGWILD!\n\nrequires strict sparsity assumptions. Using our tech-\nniques, we are able to relax these assumptions and still derive convergence rates. Moreover,\nunder HOGWILD!\u2019s stricter assumptions, we recover the previous convergence rates.\n\n2. We derive convergence results for an asynchronous SGD algorithm for a non-convex matrix\ncompletion problem. We derive the \ufb01rst rates for asynchronous SGD following the recent\n(synchronous) non-convex SGD work of De Sa et al. [4].\n\n3. We derive convergence rates in the presence of quantization errors such as those intro-\nduced by \ufb01xed-point arithmetic. We validate our results experimentally, and show that\nBUCKWILD! can achieve speedups of up to 2.3\u00d7 over HOGWILD!-based algorithms for\nlogistic regression.\n\nOne can combine these different methods both theoretically and empirically. We begin with our\nmain result, which describes our martingale-based approach and our model.\n\n2 Main Result\n\nAnalyzing asynchronous algorithms is challenging because, unlike in the sequential case where there\nis a single copy of the iterate x, in the asynchronous case each core has a separate copy of x in its\nown cache. Writes from one core may take some time to be propagated to another core\u2019s copy of\nx, which results in race conditions where stale data is used to compute the gradient updates. This\ndif\ufb01culty is compounded in the non-convex case, where a series of unlucky random events\u2014bad\ninitialization, inauspicious steps, and race conditions\u2014can cause the algorithm to get stuck near a\nsaddle point or in a local minimum.\nBroadly, we analyze algorithms that repeatedly update x by running an update step\n\nxt+1 = xt \u2212 \u02dcGt(xt),\n\n(2)\nfor some i.i.d. update function \u02dcGt. For example, for SGD, we would have G(x) = \u03b1\u2207 \u02dcft(x). The\ngoal of the algorithm must be to produce an iterate in some success region S\u2014for example, a ball\ncentered at the optimum x\u2217. For any T , after running the algorithm for T timesteps, we say that the\nalgorithm has succeeded if xt \u2208 S for some t \u2264 T ; otherwise, we say that the algorithm has failed,\nand we denote this failure event as FT .\nOur main result is a technique that allows us to bound the convergence rates of asynchronous SGD\nand related algorithms, even for some non-convex problems. We use martingale methods, which\nhave produced elegant convergence rate results for both convex and some non-convex [4] algorithms.\nMartingales enable us to model multiple forms of error\u2014for example, from stochastic sampling,\nrandom initialization, and asynchronous delays\u2014within a single statistical model. Compared to\nstandard techniques, they also allow us to analyze algorithms that sometimes get stuck, which is\nuseful for non-convex problems. Our core contribution is that a martingale-based proof for the\nconvergence of a sequential stochastic algorithm can be easily modi\ufb01ed to give a convergence rate\nfor an asynchronous version.\nA supermartingale [7] is a stochastic process Wt such that E[Wt+1|Wt] \u2264 Wt. That is, the expected\nvalue is non-increasing over time. A martingale-based proof of convergence for the sequential ver-\nsion of this algorithm must construct a supermartingale Wt(xt, xt\u22121, . . . , x0) that is a function of\nboth the time and the current and past iterates; this function informally represents how unhappy we\nare with the current state of the algorithm. Typically, it will have the following properties.\nDe\ufb01nition 1. For a stochastic algorithm as described above, a non-negative process Wt : Rn\u00d7t \u2192 R\nis a rate supermartingale with horizon B if the following conditions are true. First, it must be a\n\n2\n\n\fsupermartingale; that is, for any sequence xt, . . . , x0 and any t \u2264 B,\n\nE[Wt+1(xt \u2212 \u02dcGt(xt), xt, . . . , x0)] \u2264 Wt(xt, xt\u22121, . . . , x0).\n\n(3)\nSecond, for all times T \u2264 B and for any sequence xT , . . . , x0, if the algorithm has not succeeded\nby time T (that is, xt /\u2208 S for all t < T ), it must hold that\n\nWT (xT , xT\u22121, . . . , x0) \u2265 T.\n\n(4)\n\nThis represents the fact that we are unhappy with running for many iterations without success.\n\nUsing this, we can easily bound the convergence rate of the sequential version of the algorithm.\nStatement 1. Assume that we run a sequential stochastic algorithm, for which W is a rate super-\nmartingale. For any T \u2264 B, the probability that the algorithm has not succeeded by time T is\n\nP (FT ) \u2264 E[W0(x0)]\n\nT\n\n.\n\nProof. In what follows, we let Wt denote the actual value taken on by the function in a process\nde\ufb01ned by (2). That is, Wt = Wt(xt, xt\u22121, . . . , x0). By applying (3) recursively, for any T ,\n\nE[WT ] \u2264 E[W0] = E[W0(x0)].\n\nBy the law of total expectation applied to the failure event FT ,\n\nE[W0(x0)] \u2265 E[WT ] = P (FT ) E[WT|FT ] + P (\u00acFT ) E[WT|\u00acFT ].\n\nApplying (4), i.e. E[WT|FT ] \u2265 T , and recalling that W is nonnegative results in\n\nE[W0(x0)] \u2265 P (FT ) T ;\n\nrearranging terms produces the result in Statement 1.\n\nThis technique is very general; in subsequent sections we show that rate supermartingales can be\nconstructed for SGD on all convex problems and for some algorithms for non-convex problems.\n\n2.1 Modeling Asynchronicity\n\nThe behavior of an asynchronous SGD algorithm depends both on the problem it is trying to solve\nand on the hardware it is running on. For ease of analysis, we assume that the hardware has the\nfollowing characteristics. These are basically the same assumptions used to prove the original HOG-\nWILD! result [17].\n\n\u2022 There are multiple threads running iterations of (2), each with their own cache. At any point\nin time, these caches may hold different values for the variable x, and they communicate\nvia some cache coherency protocol.\n\n\u2022 There exists a central store S (typically RAM) at which all writes are serialized. This\n\nprovides a consistent value for the state of the system at any point in real time.\n\n\u2022 If a thread performs a read R of a previously written value X, and then writes another\nvalue Y (dependent on R), then the write that produced X will be committed to S before\nthe write that produced Y .\n\n\u2022 Each write from an iteration of (2) is to only a single entry of x and is done using an atomic\nread-add-write instruction. That is, there are no write-after-write races (handling these is\npossible, but complicates the analysis).\n\nNotice that, if we let xt denote the value of the vector x in the central store S after t writes have\noccurred, then since the writes are atomic, the value of xt+1 is solely dependent on the single thread\nthat produces the write that is serialized next in S. If we let \u02dcGt denote the update function sample\nthat is used by that thread for that write, and vt denote the cached value of x used by that write, then\n\nxt+1 = xt \u2212 \u02dcGt(\u02dcvt)\n\n3\n\n(5)\n\n\fOur hardware model further constrains the value of \u02dcvt: all the read elements of \u02dcvt must have been\nwritten to S at some time before t. Therefore, for some nonnegative variable \u02dc\u03c4i,t,\n\ni xt\u2212\u02dc\u03c4i,t,\n\neT\ni \u02dcvt = eT\n\n(6)\nwhere ei is the ith standard basis vector. We can think of \u02dc\u03c4i,t as the delay in the ith coordinate\ncaused by the parallel updates.\nWe can conceive of this system as a stochastic process with two sources of randomness: the noisy up-\ndate function samples \u02dcGt and the delays \u02dc\u03c4i,t. We assume that the \u02dcGt are independent and identically\ndistributed\u2014this is reasonable because they are sampled independently by the updating threads. It\nwould be unreasonable, though, to assume the same for the \u02dc\u03c4i,t, since delays may very well be cor-\nrelated in the system. Instead, we assume that the delays are bounded from above by some random\nvariable \u02dc\u03c4. Speci\ufb01cally, if Ft, the \ufb01ltration, denotes all random events that occurred before timestep\nt, then for any i, t, and k,\n\nP (\u02dc\u03c4i,t \u2265 k|Ft) \u2264 P (\u02dc\u03c4 \u2265 k) .\n\n(7)\n\nWe let \u03c4 = E[\u02dc\u03c4 ], and call \u03c4 the worst-case expected delay.\n\n2.2 Convergence Rates for Asynchronous SGD\n\nNow that we are equipped with a stochastic model for the asynchronous SGD algorithm, we show\nhow we can use a rate supermartingale to give a convergence rate for asynchronous algorithms. To\ndo this, we need some continuity and boundedness assumptions; we collect these into a de\ufb01nition,\nand then state the theorem.\nDe\ufb01nition 2. An algorithm with rate supermartingale W is (H, R, \u03be)-bounded if the following\nconditions hold. First, W must be Lipschitz continuous in the current iterate with parameter H; that\nis, for any t, u, v, and sequence xt, . . . , x0,\n\n(cid:107)Wt(u, xt\u22121, . . . , x0) \u2212 Wt(v, xt\u22121, . . . , x0)(cid:107)\u2264 H(cid:107)u \u2212 v(cid:107).\n\n(8)\nSecond, \u02dcG must be Lipschitz continuous in expectation with parameter R; that is, for any u, and v,\n(9)\n\nE[(cid:107) \u02dcG(u) \u2212 \u02dcG(v)(cid:107)] \u2264 R(cid:107)u \u2212 v(cid:107)1.\n\nThird, the expected magnitude of the update must be bounded by \u03be. That is, for any x,\n\n(10)\nTheorem 1. Assume that we run an asynchronous stochastic algorithm with the above hardware\nmodel, for which W is a (H, R, \u03be)-bounded rate supermartingale with horizon B. Further assume\nthat HR\u03be\u03c4 < 1. For any T \u2264 B, the probability that the algorithm has not succeeded by time T is\n\nE[(cid:107) \u02dcG(x)(cid:107)] \u2264 \u03be.\n\nP (FT ) \u2264 E[W (0, x0)]\n(1 \u2212 HR\u03be\u03c4 )T\n\n.\n\nNote that this rate depends only on the worst-case expected delay \u03c4 and not on any other properties\nof the hardware model. Compared to the result of Statement 1, the probability of failure has only\nincreased by a factor of 1 \u2212 HR\u03be\u03c4.\nIn most practical cases, HR\u03be\u03c4 (cid:28) 1, so this increase in\nprobability is negligible.\nSince the proof of this theorem is simple, but uses non-standard techniques, we outline it here.\nFirst, notice that the process Wt, which was a supermartingale in the sequential case, is not in the\nasynchronous case because of the delayed updates. Our strategy is to use W to produce a new\nprocess Vt that is a supermartingale in this case. For any t and x\u00b7, if xu /\u2208 S for all u < t, we de\ufb01ne\n\nVt(xt, . . . , x0) = Wt(xt, . . . , x0) \u2212 HR\u03be\u03c4 t + HR\n\n(cid:107)xt\u2212k+1 \u2212 xt\u2212k(cid:107)\n\nP (\u02dc\u03c4 \u2265 m) .\n\nk=1\n\nm=k\n\nCompared with W , there are two additional terms here. The \ufb01rst term is negative, and cancels out\nsome of the unhappiness from (4) that we ascribed to running for many iterations. We can interpret\nthis as us accepting that we may need to run for more iterations than in the sequential case. The\nsecond term measures the distance between recent iterates; we would be unhappy if this becomes\nlarge because then the noise from the delayed updates would also be large. On the other hand, if\nxu \u2208 S for some u < t, then we de\ufb01ne\n\nVt(xt, . . . , xu, . . . , x0) = Vu(xu, . . . , x0).\n\n4\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\n\fWe call Vt a stopped process because its value doesn\u2019t change after success occurs. It is straightfor-\nward to show that Vt is a supermartingale for the asynchronous algorithm. Once we know this, the\nsame logic used in the proof of Statement 1 can be used to prove Theorem 1.\nTheorem 1 gives us a straightforward way of bounding the convergence time of any asynchronous\nstochastic algorithm. First, we \ufb01nd a rate supermartingale for the problem; this is typically no\nharder than proving sequential convergence. Second, we \ufb01nd parameters such that the problem is\n(H, R, \u03be)-bounded, typically ; this is easily done for well-behaved problems by using differentiation\nto bound the Lipschitz constants. Third, we apply Theorem 1 to get a rate for asynchronous SGD.\nUsing this method, analyzing an asynchronous algorithm is really no more dif\ufb01cult than analyzing\nits sequential analog.\n\n3 Applications\n\nNow that we have proved our main result, we turn our attention to applications. We show, for\na couple of algorithms, how to construct a rate supermartingale. We demonstrate that doing this\nallows us to recover known rates for HOGWILD! algorithms as well as analyze cases where no\nknown rates exist.\n\n3.1 Convex Case, High Precision Arithmetic\n\nFirst, we consider the simple case of using asynchronous SGD to minimize a convex function f (x)\nusing unbiased gradient samples \u2207 \u02dcf (x). That is, we run the update rule\n\n(11)\nWe make the standard assumption that f is strongly convex with parameter c; that is, for all x and y\n(12)\n\n(x \u2212 y)T (\u2207f (x) \u2212 \u2207f (y)) \u2265 c(cid:107)x \u2212 y(cid:107)2.\n\nxt+1 = xt \u2212 \u03b1\u2207 \u02dcft(x).\n\nWe also assume continuous differentiability of \u2207 \u02dcf with 1-norm Lipschitz constant L,\n\nE[(cid:107)\u2207 \u02dcf (x) \u2212 \u2207 \u02dcf (y)(cid:107)] \u2264 L(cid:107)x \u2212 y(cid:107)1.\n\n(13)\n\n(14)\n\nWe require that the second moment of the gradient sample is also bounded for some M > 0 by\n\nE[(cid:107)\u2207 \u02dcf (x)(cid:107)2] \u2264 M 2.\n\nFor some \u0001 > 0, we let the success region be\n\nS = {x|(cid:107)x \u2212 x\u2217(cid:107)2\u2264 \u0001}.\n\nUnder these conditions, we can construct a rate supermartingale for this algorithm.\nLemma 1. There exists a Wt where, if the algorithm hasn\u2019t succeeded by timestep t,\n\n(cid:16)\n\ne(cid:107)xt \u2212 x\u2217(cid:107)2 \u0001\u22121(cid:17)\n\n+ t,\n\nWt(xt, . . . , x0) =\n\n\u0001\n\n2\u03b1c\u0001 \u2212 \u03b12M 2 log\n\nsuch that Wt is a rate submartingale for the above algorithm with horizon B = \u221e. Furthermore, it\n\u221a\nis (H, R, \u03be)-bounded with parameters: H = 2\n\n\u0001(2\u03b1c\u0001 \u2212 \u03b12M 2)\u22121, R = \u03b1L, and \u03be = \u03b1M.\n\nUsing this and Theorem 1 gives us a direct bound on the failure rate of convex HOGWILD! SGD.\nCorollary 1. Assume that we run an asynchronous version of the above SGD algorithm, where for\nsome constant \u03d1 \u2208 (0, 1) we choose step size\n\nThen for any T , the probability that the algorithm has not succeeded by time T is\n\nc\u0001\u03d1\n\n\u03b1 =\n\nM 2 + 2LM \u03c4\n\n\u221a\n\n.\n\n\u0001\n\n\u221a\n\n(cid:16)\n\ne(cid:107)x0 \u2212 x\u2217(cid:107)2 \u0001\u22121(cid:17)\n\n.\n\nP (FT ) \u2264 M 2 + 2LM \u03c4\nc2\u0001\u03d1T\n\n\u0001\n\nlog\n\nThis result is more general than the result in Niu et al. [17]. The main differences are: that we make\nno assumptions about the sparsity structure of the gradient samples; and that our rate depends only\non the second moment of \u02dcG and the expected value of \u02dc\u03c4, as opposed to requiring absolute bounds\non their magnitude. Under their stricter assumptions, the result of Corollary 1 recovers their rate.\n\n5\n\n\f3.2 Convex Case, Low Precision Arithmetic\n\nOne of the ways BUCKWILD! achieves high performance is by using low-precision \ufb01xed-point\narithmetic. This introduces additional noise to the system in the form of round-off error. We consider\nthis error to be part of the BUCKWILD! hardware model. We assume that the round-off error can\nbe modeled by an unbiased rounding function operating on the update samples. That is, for some\nchosen precision factor \u03ba, there is a random quantization function \u02dcQ such that, for any x \u2208 R, it\nholds that E[ \u02dcQ(x)] = x, and the round-off error is bounded by | \u02dcQ(x) \u2212 x|< \u03b1\u03baM. Using this\nfunction, we can write a low-precision asynchronous update rule for convex SGD as\n\nxt+1 = xt \u2212 \u02dcQt\n\n\u03b1\u2207 \u02dcft(\u02dcvt)\n\n(15)\nwhere \u02dcQt operates only on the single nonzero entry of \u2207 \u02dcft(\u02dcvt). In the same way as we did in the\nhigh-precision case, we can use these properties to construct a rate supermartingale for the low-\nprecision version of the convex SGD algorithm, and then use Theorem 1 to bound the failure rate of\nconvex BUCKWILD!\nCorollary 2. Assume that we run asynchronous low-precision convex SGD, and for some \u03d1 \u2208 (0, 1),\nwe choose step size\n\n,\n\n(cid:16)\n\n(cid:17)\n\n\u03b1 =\n\nc\u0001\u03d1\n\n\u221a\n\n,\n\nthen for any T , the probability that the algorithm has not succeeded by time T is\n\n\u221a\nP (FT ) \u2264 M 2(1 + \u03ba2) + LM \u03c4 (2 + \u03ba2)\n\n\u0001\n\nlog\n\n(cid:16)\n\n\u0001\n\ne(cid:107)x0 \u2212 x\u2217(cid:107)2 \u0001\u22121(cid:17)\n\n.\n\nM 2(1 + \u03ba2) + LM \u03c4 (2 + \u03ba2)\n\nc2\u0001\u03d1T\n\nTypically, we choose a precision such that \u03ba (cid:28) 1; in this case, the increased error compared to the\nresult of Corollary 1 will be negligible and we will converge in a number of samples that is very\nsimilar to the high-precision, sequential case. Since each BUCKWILD! update runs in less time than\nan equivalent HOGWILD! update, this result means that an execution of BUCKWILD! will produce\nsame-quality output in less wall-clock time compared with HOGWILD!\n\n3.3 Non-Convex Case, High Precision Arithmetic\n\nMany machine learning problems are non-convex, but are still solved in practice with SGD. In this\nsection, we show that our technique can be adapted to analyze non-convex problems. Unfortunately,\nthere are no general convergence results that provide rates for SGD on non-convex problems, so it\nwould be unreasonable to expect a general proof of convergence for non-convex HOGWILD! Instead,\nwe focus on a particular problem, low-rank least-squares matrix completion,\n\nminimize E[(cid:107) \u02dcA \u2212 xxT(cid:107)2\nF ]\nsubject to x \u2208 Rn,\n\n(16)\n\nfor which there exists a sequential SGD algorithm with a martingale-based rate that has already\nbeen proven. This problem arises in general data analysis, subspace tracking, principle component\nanalysis, recommendation systems, and other applications [4]. In what follows, we let A = E[ \u02dcA].\nWe assume that A is symmetric, and has unit eigenvectors u1, u2, . . . , un with corresponding eigen-\nvalues \u03bb1 > \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbn. We let \u2206, the eigengap, denote \u2206 = \u03bb1 \u2212 \u03bb2.\nDe Sa et al. [4] provide a martingale-based rate of convergence for a particular SGD algorithm,\nAlecton, running on this problem. For simplicity, we focus on only the rank-1 version of the prob-\nlem, and we assume that, at each timestep, a single entry of A is used as a sample. Under these\nconditions, Alecton uses the update rule\n\n(17)\nwhere \u02dcit and \u02dcjt are randomly-chosen indices in [1, n]. It initializes x0 uniformly on the sphere of\nsome radius centered at the origin. We can equivalently think of this as a stochastic power iteration\nalgorithm. For any \u0001 > 0, we de\ufb01ne the success set S to be\n\nxt+1 = (I + \u03b7n2e\u02dcit\n\nAe\u02dcjt\n\n)xt,\n\neT\n\u02dcjt\n\neT\n\u02dcit\n\n(18)\nThat is, we are only concerned with the direction of x, not its magnitude; this algorithm only recovers\nthe dominant eigenvector of A, not its eigenvalue. In order to show convergence for this entrywise\nsampling scheme, De Sa et al. [4] require that the matrix A satisfy a coherence bound [10].\n\nS = {x|(uT\n\n1 x)2 \u2265 (1 \u2212 \u0001)(cid:107)x(cid:107)2}.\n\n6\n\n\fTable 1: Training loss of SGD as a function of arithmetic precision for logistic regression.\n\nDataset\nReuters\nForest\nRCV1\nMusic\n\nRows Columns\n8K\n581K\n781K\n515K\n\n18K\n54\n47K\n91\n\nSize\n1.2GB\n0.2GB\n0.9GB\n0.7GB\n\n32-bit \ufb02oat\n\n0.5700\n0.6463\n0.1888\n0.8785\n\n16-bit int\n0.5700\n0.6463\n0.1888\n0.8785\n\n8-bit int\n0.5709\n0.6447\n0.1879\n0.8781\n\nDe\ufb01nition 3. A matrix A \u2208 Rn\u00d7n is incoherent with parameter \u00b5 if for every standard basis vector\nej, and for all unit eigenvectors ui of the matrix, (eT\nThey also require that the step size be set, for some constants 0 < \u03b3 \u2264 1 and 0 < \u03d1 < (1 + \u0001)\u22121 as\n\nj ui)2 \u2264 \u00b52n\u22121.\n\n\u03b7 =\n\n\u2206\u0001\u03b3\u03d1\n\n2n\u00b54 (cid:107)A(cid:107)2\n\nF\n\n.\n\nFor ease of analysis, we add the additional assumptions that our algorithm runs in some bounded\nspace. That is, for some constant C, at all times t, 1 \u2264 (cid:107)xt(cid:107) and (cid:107)xt(cid:107)1 \u2264 C. As in the convex\ncase, by following the martingale-based approach of De Sa et al. [4], we are able to generate a rate\nsupermartinagle for this algorithm\u2014to save space, we only state its initial value and not the full\nexpression.\nLemma 2. For the problem above, choose any horizon B such that \u03b7\u03b3\u0001\u2206B \u2264 1. Then there exists\na function Wt such that Wt is a rate supermartingale for the above non-convex SGD algorithm with\nparameters H = 8n\u03b7\u22121\u03b3\u22121\u2206\u22121\u0001\u2212 1\n\nE [W0(x0)] \u2264 2\u03b7\u22121\u2206\u22121 log(en\u03b3\u22121\u0001\u22121) + B(cid:112)2\u03c0\u03b3.\n\n2 , R = \u03b7\u00b5(cid:107)A(cid:107)F , and \u03be = \u03b7\u00b5(cid:107)A(cid:107)F C, and\n\nNote that the analysis parameter \u03b3 allows us to trade off between B, which determines how long we\ncan run the algorithm, and the initial value of the supermartingale E [W0(x0)]. We can now produce\na corollary about the convergence rate by applying Theorem 1 and setting B and T appropriately.\nCorollary 3. Assume that we run HOGWILD! Alecton under these conditions for T timesteps, as\nde\ufb01ned below. Then the probability of failure, P (FT ), will be bounded as below.\n\nT =\n\n4n\u00b54 (cid:107)A(cid:107)2\n\u221a\nF\n\u22062\u0001\u03b3\u03d1\n2\u03c0\u03b3\n\nlog\n\n(cid:18) en\n\n(cid:19)\n\n\u03b3\u0001\n\n,\n\nP (FT ) \u2264\n\n\u221a\n\n8\u03c0\u03b3\u00b52\n\u221a\n\u00b52 \u2212 4C\u03d1\u03c4\n\n.\n\n\u0001\n\nThe fact that we are able to use our technique to analyze a non-convex algorithm illustrates its\ngenerality. Note that it is possible to combine our results to analyze asynchronous low-precision\nnon-convex SGD, but the resulting formulas are complex, so we do not include them here.\n\n4 Experiments\n\nWe validate our theoretical results for both asynchronous non-convex matrix completion and BUCK-\nWILD!, a HOGWILD! implementation with lower-precision arithmetic. Like HOGWILD!, a BUCK-\nWILD! algorithm has multiple threads running an update rule (2) in parallel without locking. Com-\npared with HOGWILD!, which uses 32-bit \ufb02oating point numbers to represent input data, BUCK-\nWILD! uses limited-precision arithmetic by rounding the input data to 8-bit or 16-bit integers. This\nnot only decreases the memory usage, but also allows us to take advantage of single-instruction-\nmultiple-data (SIMD) instructions for integers on modern CPUs.\nWe veri\ufb01ed our main claims by running HOGWILD! and BUCKWILD! algorithms on the discussed\napplications. Table 1 shows how the training loss of SGD for logistic regression, a convex problem,\nvaries as the precision is changed. We ran SGD with step size \u03b1 = 0.0001; however, results are\nsimilar across a range of step sizes. We analyzed all four datasets reported in DimmWitted [25] that\nfavored HOGWILD!: Reuters and RCV1, which are text classi\ufb01cation datasets; Forest, which arises\nfrom remote sensing; and Music, which is a music classi\ufb01cation dataset. We implemented all GLM\nmodels reported in DimmWitted, including SVM, Linear Regression, and Logistic Regression, and\n\n7\n\n\fPerformance of BUCKWILD! for Logistic Regression\n\nl\na\ni\nt\nn\ne\nu\nq\ne\ns\n\nt\ni\nb\n-\n2\n3\n\nr\ne\nv\no\n\np\nu\nd\ne\ne\np\ns\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n2\n\n1\n\n32-bit \ufb02oat\n16-bit int\n8-bit int\n\n1\n\n4\n\n12\n\nthreads\n\n24\n\n!\nD\nL\nI\nW\nG\nO\nH\n\nt\ns\ne\nb\n\nt\ni\nb\n-\n2\n3\n\nr\ne\nv\no\n\np\nu\nd\ne\ne\np\ns\n\n2\n\u2212\n(cid:107)\nx\n(cid:107)\n\n2\n)\nx\nT1\nu\n(\n\n1\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\nHogwild vs. Sequential Alecton for n = 106\n\nsequential\n12-thread hogwild\n0.8\n\n1.2\n\n1\n\n1.4\n\n1.6\n\n0\n\n0.2\n\n0.4\n\n0.6\n\niterations (billions)\n\n(a) Speedup of BUCKWILD!\ndataset.\n\nfor dense RCV1\n\n(b) Convergence trajectories for sequential ver-\nsus HOGWILD! Alecton.\n\nFigure 1: Experiments compare the training loss, performance, and convergence of HOGWILD! and\nBUCKWILD! algorithms with sequential and/or high-precision versions.\n\nreport Logistic Regression because other models have similar performance. The results illustrate\nthat there is almost no increase in training loss as the precision is decreased for these problems. We\nalso investigated 4-bit and 1-bit computation: the former was slower than 8-bit due to a lack of 4-bit\nSIMD instructions, and the latter discarded too much information to produce good quality results.\nFigure 1(a) displays the speedup of BUCKWILD! running on the dense-version of the RCV1 dataset\ncompared to both full-precision sequential SGD (left axis) and best-case HOGWILD! (right axis).\nExperiments ran on a machine with two Xeon X650 CPUs, each with six hyperthreaded cores, and\n24GB of RAM. This plot illustrates that incorporating low-precision arithmetic into our algorithm\nallows us to achieve signi\ufb01cant speedups over both sequential and HOGWILD! SGD. (Note that we\ndon\u2019t get full linear speedup because we are bound by the available memory bandwidth; beyond\nthis limit, adding additional threads provides no bene\ufb01ts while increasing con\ufb02icts and thrashing\nthe L1 and L2 caches.) This result, combined with the data in Table 1, suggest that by doing low-\nprecision asynchronous updates, we can get speedups of up to 2.3\u00d7 on these sorts of datasets without\na signi\ufb01cant increase in error.\nFigure 1(b) compares the convergence trajectories of HOGWILD! and sequential versions of the non-\nconvex Alecton matrix completion algorithm on a synthetic data matrix A \u2208 Rn\u00d7n with ten random\neigenvalues \u03bbi > 0. Each plotted series represents a different run of Alecton; the trajectories differ\nsomewhat because of the randomness of the algorithm. The plot shows that the sequential and\nasynchronous versions behave qualitatively similarly, and converge to the same noise \ufb02oor. For this\ndataset, sequential Alecton took 6.86 seconds to run while 12-thread HOGWILD! Alecton took 1.39\nseconds, a 4.9\u00d7 speedup.\n\n5 Conclusion\n\nThis paper presented a uni\ufb01ed theoretical framework for producing results about the convergence\nrates of asynchronous and low-precision random algorithms such as stochastic gradient descent. We\nshowed how a martingale-based rate of convergence for a sequential, full-precision algorithm can\nbe easily leveraged to give a rate for an asynchronous, low-precision version. We also introduced\nBUCKWILD!, a strategy for SGD that is able to take advantage of modern hardware resources for\nboth task and data parallelism, and showed that it achieves near linear parallel speedup over sequen-\ntial algorithms.\n\nAcknowledgments\n\nThe BUCKWILD! name arose out of conversations with Benjamin Recht. Thanks also to Madeleine Udell\nfor helpful conversations. The authors acknowledge the support of: DARPA FA8750-12-2-0335; NSF IIS-\n1247701; NSF CCF-1111943; DOE 108845; NSF CCF-1337375; DARPA FA8750-13-2-0039; NSF IIS-\n1353606; ONR N000141210041 and N000141310129; NIH U54EB020405; Oracle; NVIDIA; Huawei; SAP\nLabs; Sloan Research Fellowship; Moore Foundation; American Family Insurance; Google; and Toshiba.\n\n8\n\n\fReferences\n[1] L\u00b4eon Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT\u20192010, pages\n\n177\u2013186. Springer, 2010.\n\n[2] L\u00b4eon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421\u2013436.\n\nSpringer, 2012.\n\n[3] L\u00b4eon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer,\n\nand S. Roweis, editors, NIPS, volume 20, pages 161\u2013168. NIPS Foundation, 2008.\n\n[4] Christopher De Sa, Kunle Olukotun, and Christopher R\u00b4e. Global convergence of stochastic gradient\n\ndescent for some nonconvex matrix problems. ICML, 2015.\n\n[5] John C Duchi, Peter L Bartlett, and Martin J Wainwright. Randomized smoothing for stochastic opti-\n\nmization. SIAM Journal on Optimization, 22(2):674\u2013701, 2012.\n\n[6] Olivier Fercoq and Peter Richt\u00b4arik. Accelerated, parallel and proximal coordinate descent. arXiv preprint\n\narXiv:1312.5799, 2013.\n\n[7] Thomas R Fleming and David P Harrington. Counting processes and survival analysis. volume 169,\n\npages 56\u201357. John Wiley & Sons, 1991.\n\n[8] Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Zadeh. WTF: The who\n\nto follow service at twitter. WWW \u201913, pages 505\u2013514, 2013.\n\n[9] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with lim-\n\nited numerical precision. ICML, 2015.\n\n[10] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating\n\nminimization. In STOC, pages 665\u2013674. ACM, 2013.\n\n[11] Bj\u00a8orn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradient method for\ndistributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157\u20131170, 2009.\nIn NIPS\n\n[12] Jakub Konecn`y, Zheng Qu, and Peter Richt\u00b4arik. S2cd: Semi-stochastic coordinate descent.\n\nOptimization in Machine Learning workshop, 2014.\n\n[13] Yann Le Cun, L\u00b4eon Bottou, Genevieve B. Orr, and Klaus-Robert M\u00a8uller. Ef\ufb01cient backprop. In Neural\n\nNetworks, Tricks of the Trade. 1998.\n\n[14] Ji Liu and Stephen J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence\n\nproperties. SIOPT, 25(1):351\u2013376, 2015.\n\n[15] Ji Liu, Stephen J Wright, Christopher R\u00b4e, Victor Bittorf, and Srikrishna Sridhar. An asynchronous parallel\n\nstochastic coordinate descent algorithm. JMLR, 16:285\u2013322, 2015.\n\n[16] Ioannis Mitliagkas, Michael Borokhovich, Alexandros G. Dimakis, and Constantine Caramanis. Frog-\n\nwild!: Fast pagerank approximations on graph engines. PVLDB, 2015.\n\n[17] Feng Niu, Benjamin Recht, Christopher Re, and Stephen Wright. Hogwild: A lock-free approach to\n\nparallelizing stochastic gradient descent. In NIPS, pages 693\u2013701, 2011.\n\n[18] Cyprien Noel and Simon Osindero. Dogwild!\u2013Distributed Hogwild for CPU & GPU. 2014.\n[19] Shameem Ahamed Puthiya Parambath. Matrix factorization methods for recommender systems. 2013.\n[20] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly\n\nconvex stochastic optimization. ICML, 2012.\n\n[21] Peter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Parallel coordinate descent methods for big data optimization. Mathe-\n\nmatical Programming, pages 1\u201352, 2012.\n\n[22] Qing Tao, Kang Kong, Dejun Chu, and Gaowei Wu. Stochastic coordinate descent methods for regular-\nized smooth and nonsmooth losses. In Machine Learning and Knowledge Discovery in Databases, pages\n537\u2013552. Springer, 2012.\n\n[23] Rachael Tappenden, Martin Tak\u00b4a\u02c7c, and Peter Richt\u00b4arik. On the complexity of parallel coordinate descent.\n\narXiv preprint arXiv:1503.03033, 2015.\n\n[24] Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S Dhillon. Scalable coordinate descent approaches to\n\nparallel matrix factorization for recommender systems. In ICDM, pages 765\u2013774, 2012.\n\n[25] Ce Zhang and Christopher Re. Dimmwitted: A study of main-memory statistical analytics. PVLDB,\n\n2014.\n\n9\n\n\f", "award": [], "sourceid": 1554, "authors": [{"given_name": "Christopher", "family_name": "De Sa", "institution": "Stanford"}, {"given_name": "Ce", "family_name": "Zhang", "institution": "Wisconsin"}, {"given_name": "Kunle", "family_name": "Olukotun", "institution": "Stanford"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": null}]}