{"title": "Cyclades: Conflict-free Asynchronous Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2568, "page_last": 2576, "abstract": "We present Cyclades, a general framework for parallelizing stochastic optimization algorithms in a shared memory setting. Cyclades is asynchronous during model updates, and requires no memory locking mechanisms, similar to Hogwild!-type algorithms. Unlike Hogwild!, Cyclades introduces no conflicts during parallel execution, and offers a black-box analysis for provable speedups across a large family of algorithms. Due to its inherent cache locality and conflict-free nature, our multi-core implementation of Cyclades consistently outperforms Hogwild!-type algorithms on sufficiently sparse datasets, leading to up to 40% speedup gains compared to Hogwild!, and up to 5\\times gains over asynchronous implementations of variance reduction algorithms.", "full_text": "Con\ufb02ict-free Asynchronous Machine Learning\n\nCYCLADES:\n\nXinghao Pan\u21e4, Maximilian Lam\u21e4,\n\nStephen Tu\u21e4, Dimitris Papailiopoulos\u21e4,\n\nCe Zhang\u2020, Michael I. Jordan\u21e4\u2021, Kannan Ramchandran\u21e4, Chris Re\u2020, Benjamin Recht\u21e4\u2021\n\nAbstract\n\nWe present CYCLADES, a general framework for parallelizing stochastic optimiza-\ntion algorithms in a shared memory setting. CYCLADES is asynchronous during\nmodel updates, and requires no memory locking mechanisms, similar to HOG-\nWILD!-type algorithms. Unlike HOGWILD!, CYCLADES introduces no con\ufb02icts\nduring parallel execution, and offers a black-box analysis for provable speedups\nacross a large family of algorithms. Due to its inherent cache locality and con\ufb02ict-\nfree nature, our multi-core implementation of CYCLADES consistently outperforms\nHOGWILD!-type algorithms on suf\ufb01ciently sparse datasets, leading to up to 40%\nspeedup gains compared to HOGWILD!, and up to 5\u21e5 gains over asynchronous\nimplementations of variance reduction algorithms.\n\n1\n\nIntroduction\n\nFollowing the seminal work of HOGWILD! [17], many studies have demonstrated that near linear\nspeedups are achievable on several machine learning tasks via asynchronous, lock-free implemen-\ntations [25, 13, 8, 16]. In all of these studies, classic algorithms are parallelized by simply running\nparallel and asynchronous model updates without locks. These lock-free, asynchronous algorithms\nexhibit speedups even when applied to large, non-convex problems, as demonstrated by deep learn-\ning systems such as Google\u2019s Downpour SGD [6] and Microsoft\u2019s Project Adam [4]. While these\ntechniques have been remarkably successful, many of the above papers require delicate and tailored\nanalyses to quantify the bene\ufb01ts of asynchrony for each particular learning task. Moreover, in\nnon-convex settings, we currently have little quantitative insight into how much speedup is gained\nfrom asynchrony.\nIn this work, we present CYCLADES, a general framework for lock-free, asynchronous machine\nlearning algorithms that obviates the need for specialized analyses. CYCLADES runs asynchronously\nand maintains serial equivalence, i.e., it produces the same outcome as the serial algorithm. Since\nit returns the same output as a serial implementation, any algorithm parallelized by our framework\ninherits the correctness proof of the serial counterpart without modi\ufb01cations. Additionally, if a\nparticular serially run heuristic is popular, but does not have a rigorous analysis, CYCLADES still\nguarantees that its execution will return a serially equivalent output.\nCYCLADES achieves serial equivalence by partitioning updates among cores, in a way that ensures\nthat there are no con\ufb02icts across partitions. Such a partition can always be found ef\ufb01ciently by\nleveraging a powerful result on graph phase transitions [12]. When applied to our setting, this result\nguarantees that a suf\ufb01ciently small sample of updates will have only a logarithmic number of con\ufb02icts.\nThis allows us to evenly partition model updates across cores, with the guarantee that all con\ufb02icts are\nlocalized within each core. Given enough problem sparsity, CYCLADES guarantees a nearly linear\n\n\u21e4Department of Electrical Engineering and Computer Science, UC Berkeley, Berkeley, CA.\n\u2020Department of Computer Science, Stanford University, Palo Alto, CA.\n\u2021Department of Statistics, UC Berkeley, Berkeley, CA.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fspeedup, while inheriting all the qualitative properties of the serial counterpart of the algorithm, e.g.,\nproofs for rates of convergence. Enforcing a serially equivalent execution in CYCLADES comes with\nadditional practical bene\ufb01ts. Serial equivalence is helpful for hyperparameter tuning, or locating\nthe best model produced by the asynchronous execution, since experiments are reproducible, and\nsolutions are easily veri\ufb01able. Moreover, a CYCLADES program is easy to debug because bugs are\nrepeatable and we can examine the step-wise execution to localize them.\nA signi\ufb01cant bene\ufb01t of the update partitioning in CYCLADES is that it induces considerable access\nlocality compared to the more unstructured nature of the memory accesses during HOGWILD!. Cores\nwill access the same data points and read/write the same subset of model variables. This has the addi-\ntional bene\ufb01t of reducing false sharing across cores. Because of these gains, CYCLADES can actually\noutperform HOGWILD! in practice on suf\ufb01ciently sparse problems, despite appearing to require\nmore computational overhead. Remarkably, because of the added locality, even a single threaded\nimplementation of CYCLADES can actually be faster than serial SGD. In our SGD experiments for\nmatrix completion and word embedding problems, CYCLADES can offer a speedup gain of up to 40%\ncompared to that of HOGWILD!. Furthermore, for variance reduction techniques such as SAGA [7]\nand SVRG [11], CYCLADES yields better accuracy and more signi\ufb01cant speedups, with up to 5\u21e5\nperformance gains over HOGWILD!-type implementations.\n\nAlgorithm 1 Stochastic Updates\n1: Input: x; f1, . . . , fn; T\n2: for t = 1 : T do\nsample i \u21e0 D\n3:\n4:\nxSi = ui(xSi, fi)\n5: Output: x\n\n2 The Algorithmic Family of Stochastic-Updates\nWe study parallel asynchronous iterative algorithms on the computational model used by [17]: several\ncores have access to the same shared memory, and each of them can read and update components of\nthe shared memory. In this work, we consider a family of randomized algorithms that we refer to as\nStochastic Updates (SU). The main algorithmic component of SU focuses on updating small subsets\nof a model variable x, according to pre\ufb01xed access patterns, as sketched by Alg. 1.\nIn Alg. 1, Si is a subset of the coordinates in x, each function\nfi operates on the subset Si of coordinates, and ui is a local\nupdate function that computes a vector with support on Si using\nas input xSi and fi. Moreover, T is the total number of iterations,\nand D is the distribution with support {1, . . . , n} from which we\ndraw i. Several machine learning algorithms belong to the SU\nalgorithmic family, such as stochastic gradient descent (SGD),\nwith or without weight decay and regularization, variance-reduced\nlearning algorithms like SAGA and SVRG, and even some combinatorial graph algorithms. In our\nsupplemental material, we explain how these algorithms can be phrased in the SU language.\ncon\ufb02ict graph\nThe updates con\ufb02ict graph A useful construct for our develop-\nments is the con\ufb02ict graph between updates, which can be generated\nfrom the bipartite graph between the updates and the model variables.\nWe de\ufb01ne these graphs below, and provide an illustrative sketch in\nFig. 1.\nDe\ufb01nition 1. Let Gu denote the bipartite update-variable graph\nbetween the n updates and the d model variables. An update ui\nis linked to a variable xj, if ui requires to read/write xj. Let Eu\ndenote the number of edges in the bipartite graph, L the max left\ndegree of Gu, and L the average left degree. Finally, we denote\nby Gc the con\ufb02ict graph on the n updates. Two vertices in Gc are\nlinked, if the corresponding updates share at least one variable in\nGu. We also denote as the max vertex degree of Gc.\n\nFigure 1: In the bipartite graph,\nan update ui is linked to variable\nxj when it needs to read/write it.\nFrom Gu we obtain the con\ufb02ict\ngraph Gc, whose max degree is\n. If that is small, we expect that\nit is possible to parallelize updates\nwithout too many con\ufb02icts. CY-\nCLADES exploits this intuition.\nWe stress that the con\ufb02ict graph is never constructed, but is a useful for understanding CYCLADES.\n\nx1\nx2\n\nu1\nu2\n\nun\n\n...\n\nun\n\n...\n\nxd\n\nu1\n\nsample\n\nu2\n\nOur Main Result By exploiting the structure of the above graphs and through a light-weight\nsampling and allocation of updates, CYCLADES guarantees the following result for SU algorithms,\nwhich we establish in the following sections.\nTheorem 1 (informal). Let an SU algorithm A be de\ufb01ned through n update rules, where the con\ufb02ict\nmax degree between the n updates is , and the sampling distribution D is uniform with (or without)\nreplacement from {1, . . . , n}. Moreover, assume that we wish to run A for T = \u21e5(n) iterations, and\n\n2\n\n\f\u00b7L\n\ni=1 `i(aT\n\nnPn\n\nL \uf8ff pn. Then on up to P = \u02dcO( n\n\n) cores, CYCLADES guarantees ae\u2326(P ) speedup over\nthat L\nA, while outputting the same solution x as A would do after the same random set of T iterations.4\nWe now provide two examples of how these guarantees translate for speci\ufb01c problem cases.\nExample 1. In many applications we seek to minimize: minx\ni x) where ai represents\nthe ith data point, x is the parameter vector, and `i is a loss. Several problems can be formulated in\nthis way, such as logistic regression, least squares, binary classi\ufb01cation, etc. If we tackle the above\nproblem using SGD, or techniques like SVRG and SAGA, then (as we show in the supplemental) the\nupdate sparsity is determined by the gradient of a single sampled data point ai. Here, we will have\nL = maxi ||ai||0, and will be equal to the maximum number of data points ai that share at least\none feature. As a toy example, let n\nd = \u21e5(1) and let the non-zero support of ai be of size n and\nuniformly distributed. Then, one can show that with high probability = eO(n1/2+) and hence\nCYCLADES achieves ane\u2326(P ) speedup on up to P = eO(n1/22) cores.\nExample 2. Consider the generic optimization minxi,yj ,i2[n]Pm\ni=1Pm\n\nj=1 i,j(xi, yj), which cap-\ntures several problems like matrix completion and factorization [17], word embeddings [2], graph\nk-way cuts [17], etc. Assume that we aim to minimize the above by sampling a single function\ni,j and then updating xi and yj using SGD. Here, the number of update functions is proportional\nto n = m2, and each gradient update with respect to the sampled function i,j(xi, yj) is only\ninteracting with the variables xi and yj, i.e., only two variable vectors out of the 2m vectors (i.e.,\nL = 2). This also implies a con\ufb02ict degree of at most = 2m. Here, CYCLADES can provably\n\n1\n\nguarantee ane\u2326(P ) speedup for up to P = O(m) cores.\n\nIn our experiments we test CYCLADES on several problems including least squares, classi\ufb01cation\nwith logistic models, matrix factorization, and word embeddings, and several algorithms including\nSGD, SVRG, and SAGA. We show that in most cases it can signi\ufb01cantly outperform the HOGWILD!\nimplementation of these algorithms, if the data is sparse.\nRemark 1. We would like to note that there are several cases where there might be a few outlier\nupdates with extremely high con\ufb02ict degree. In the supplemental material, we prove that if there are\nno more than O(n) vertices of high con\ufb02ict degree o, and the rest of the vertices have max degree\nat most , then the result of Theorem 1 still holds in expectation.\n\nIn the following section, we establish the theory of CYCLADES and provide the details behind our\nparallelization framework.\n3 CYCLADES: Shattering Dependencies\n\nof\n\nthree\n\nconsists\n\nCYCLADES\ncomputational\nIt starts by sampling (according to a distribution D) a\nnumber of B updates from the graph shown in Fig. 1,\nand assigns a label to each of them (a processing\norder). After sampling, it computes the connected\ncomponents of the sampled subgraph induced by the\nB sampled updates, to determine the con\ufb02ict groups.\nOnce the con\ufb02icts groups are formed, it allocates\nthem across P cores. Finally, each core processes\nlocally the con\ufb02ict groups of updates that it has been\nassigned, following the order that each update has\nbeen labeled with. The above process is then repeated,\nfor as many iterations as needed. The key component\nof CYCLADES is to carry out the sampling in such\na way that we have as many connected components\nas possible, and all of them of small size, provably.\nIn the next subsections, we explain how each part\nis carried out, and provide theoretical guarantees for\neach of them individually, which we combine at the\nend of this section for our main theorem.\n\n4e\u2326(\u00b7) and eO(\u00b7) hide polylog factors.\n\n3\n\ncomponents\n\nas\n\nshown\n\nin\n\nFig.\n\n2.\n\nSample Batch + Connected Components\n\ncon\ufb02ict-graph\n\nsample\n\nC.C.\n\nAllocation\n\nCore1\n\nCore 2\n\nCore p\n\nAsynchronous and Lock-free Stochastic Updates\n\nSU\nCore1\n\nSU\nCore 2\n\nBatch Synchronization\n\nSU\nCore p\n\nFigure 2: CYCLADES samples updates, \ufb01nds\ncon\ufb02ict-groups, and allocates them across cores.\nEach core asynchronously updates the model, with-\nout access con\ufb02icts. This is possible by processing\nthe con\ufb02icting updates within the same core.\n\n\fA key technical aspect that we exploit in CYCLADES is that appropriate sampling and allocation of\nupdates can lead to near optimal parallelization of SU algorithms. To do that we expand upon the\nfollowing result established in [12].\nTheorem 2. Let G be a graph on n vertices, with max degree . Let us sample each vertex\nindependently with probability p = 1\u270f\n and de\ufb01ne as G0 the induced subgraph on the sampled\nvertices. Then, the largest connected component of G0 has size at most 4\n\u270f2 log n, with high probability.\nThe above result pays homage to the giant component phase transition phenomena in random\nErdos-Renyi graphs. What is surprising is that similar phase transitions apply to any given graph!\nIn practice, for most SU algorithms of interest, the sampling distribution of updates is either with or\nwithout replacement from the n updates. As it turns out, morphing Theorem 2 into a with-/without-\nreplacement result is not straightforward. We defer the analysis needed to the supplemental material,\nand present our main theorem about graph sampling here.\nTheorem 3. Let G be a graph on n vertices, with max degree . Let us sample B = (1\u270f)n\n vertices\nwith or without replacement, and de\ufb01ne as G0 the induced subgraph on the sampled vertices. Then,\nthe largest connected component of G0 has size at most O( log n\nThe key idea from the above is that if one samples no more than B = (1\u270f) n\nbe at least O (\u270f2B/log n) con\ufb02ict groups to allocate across cores, each of size at most Olog n/\u270f2.\nSince there are no con\ufb02icts between different con\ufb02ict-groups, the processing of updates per any single\ngroup will never interact with the variables corresponding to the updates of another con\ufb02ict group.\nThe next step of CYCLADES is to form and allocate the connected components (CCs) across cores,\nef\ufb01ciently. We address this in the following subsection. In the following, for brevity we focus on the\nwith-replacement sampling case, but the results can be extended to the without-replacement case.\n\n\u270f2 ), with high probability.\n\n updates, then there will\n\n) and let L\n1\u270f batches, each of size B = (1 \u270f) n\n\nIdentifying groups of con\ufb02ict\nIn CYCLADES, we sample batches of updates of size B multiple\ntimes, and for each batch we need to identify the con\ufb02ict groups across the updates. Let us refer\nto Gi\nu as the subgraph induced by the ith sampled batch of updates on the update-variable graph\nGu. In the following we always assume that we sample nb = c \u00b7 /(1 \u270f) batches, where c 1 is\na constant. This number of batches results in a constant number of passes over the dataset. Then,\nidentifying the con\ufb02ict groups in Gi\nu can be done with a connected components (CC) algorithm. The\nmain question we need to address is what is the best way to parallelize this graph partitioning part. In\nthe supplemental, we provide the details of this part, and prove the following result:\nL \uf8ff pn. Then, the overall computation\nLemma 1. Let the number of cores be P = O( n\nL\n, costs no more than O(Eu/P log2 n).\nof CCs for nb = c \u00b7 \nAllocating updates to cores Once we compute the CCs (i.e., the con\ufb02icts groups of the sampled\nupdates), we have to allocate them across cores. Once a core has been assigned with CCs, it will\nprocess the updates included in these CCs, according to the order that each update has been labeled\nwith. Due to Theorem 3, each connected component will contain at most O( log n\n\u270f2 ) updates. Assuming\nthat the cost of the j-th update in the batch is wj, the cost of a single connected component C will be\nwC =Pj2C wj. To proceed with characterizing the maximum load among the P cores, we assume\nthat the cost of a single update ui, for i 2 {1, . . . , n}, is proportional to the out-degree of that update\n\u2014according to the update-variable graph Gu\u2014 times a constant cost which we shall refer to as \uf8ff.\nHence, wj = O(dL,j \u00b7 \uf8ff), where dL,j is the degree of the j-th left vertex of Gu. In the supplemental\nmaterial, we establish that a near-uniform allocation of CCs according to their weights leads to the\nfollowing guarantee.\nL \uf8ff pn. Then, computing\nLemma 2. Let the number of cores by bounded as P = O( n\nL\nthe stochastic updates across all nb = c \u00b7 \n\u00b7 \uf8ff), with\nhigh probability, where \uf8ff is the per edge cost for computing one of the n updates de\ufb01ned on Gu.\n\n1\u270f batches can be performed in time O( E log2 n\n\n), and let L\n\nP\n\nStitching the pieces together Now that we have described the sampling, con\ufb02ict computation, and\nallocation strategies, we are ready to put all the pieces together and detail CYCLADES in full. Let us\n, and that each\nassume that we sample a total number of nb = c \u00b7 \nupdate is sampled uniformly at random. For the i-th batch let us denote as Ci\nmi the connected\n\n1\u270f batches of size B = (1 \u270f) n\n1, . . .Ci\n\n4\n\n\fP\n\ncomponents on the induced subgraph Gi\nu. Due to Theorem 3, each connected component C contains a\nnumber of at most O( log n\n\u270f2 ) updates; each update carries an ID (the order of which it would have been\nsampled by the serial algorithm). Using the above notation, we give the pseudocode for CYCLADES\nin Alg. 2. Note that the inner loop that is parallelized (i.e., the SU processing loop in lines 6 \u2013 9), can\nbe performed asynchronously; cores do not have to synchronize, and do not need to lock any memory\nvariables, as they are all accessing non-overlapping subset of x. This also provides for better cache\ncoherence. Moreover, each core potentially accesses the same coordinates several times, leading to\ngood cache locality. These improved cache locality and coherence properties experimentally lead to\nsubstantial performance gains as we see in the next section. We can now combine the results of the\nprevious subsection to obtain our main theorem for CYCLADES.\nTheorem 4. Let us assume any given update-variable graph Gu with L and L, such that\nL \uf8ff pn, and with induced max con\ufb02ict degree . Then, CYCLADES on P = O( n\n) cores, with\nL\n can execute T = c \u00b7 n updates, for any constant c 1, selected uniformly\nbatch sizes B = (1 \u270f) n\nat random with replacement, in time O Eu\u00b7\uf8ff\n\n\u00b7 log2 n , with high probability.\n\nAllocation of Ci\nfor each core in parallel do\n\nAlgorithm 2 CYCLADES\n1: Input: Gu, nb.\nu from Gu\n2: Sample nb subgraphs G1\n3: Compute in parallel CCs for sampled graphs\n4: for batch i = 1 : nb do\n5:\n6:\n7:\n8:\n9:\n10: Output: x\n\nObserve that CYCLADES bypasses the need to es-\ntablish convergence guarantees for the parallel algo-\nrithm. Hence, it could be the case for an applications\nof interest that we cannot analyze how a serial SU al-\ngorithm performs in terms of say the accuracy of the\nsolution, but CYCLADES can still provide black box\nguarantees for speedup, since our analysis is com-\npletely oblivious to the qualitative performance of\nthe serial algorithm. This is in contrast to recent stud-\nies similar to [5], where the authors provide speedup\nguarantees via a convergence-to-optimal proof for an\nasynchronous SGD on a nonconvex problem. Unfor-\ntunately these proofs can become complicated on a wider range of nonconvex objectives.\nIn the following section we show that CYCLADES is not only useful theoretically, but can consistently\noutperform HOGWILD! on suf\ufb01ciently sparse datasets.\n4 Evaluation\nWe implemented CYCLADES5 in C++ and tested it on a variety of problems, and a number of\nstochastic updates algorithms, and compared against their HOGWILD! (i.e., asynchronous, lock-free)\nimplementations. Since CYCLADES is intended to be a general SU parallelization framework, we\ndo not compare against algorithms tailored to speci\ufb01c applications, nor do we expect CYCLADES\nto outperform every such highly-tuned, well-designed, speci\ufb01c algorithms. Our experiments were\nconducted on a machine with 72 CPUs (Intel(R) Xeon(R) CPU E7-8870 v3, 2.10 GHz) on 4 NUMA\nnodes, each with 18 CPUs, and 1TB of memory. We ran CYCLADES and HOGWILD! with 1, 4, 8, 16\nand 18 threads pinned to CPUs on a single NUMA node (i.e., the maximum physical number of cores\nper single node), to can avoid well-known cache coherence and scaling issues across nodes [24].\n\nmi to P cores\n\nfor each allocated component C do\n\nfor each ordered update j from C do\n\nxSj = uj(xSj , fj)\n\n\u00b7L\n\nu, . . . , Gnb\n\n1, . . .Ci\n\nDataset\nNH2010\nDBLP\n\nMovieLens\nEN-Wiki\n\n# datapoints\n\n48,838\n5,425,964\n\u21e010M\n20,207,156\n\n# features\n48,838\n5,425,964\n82,250\n213,272\n\nav. sparsity / datapoint\n\nComments\n\n4.8026\n3.1880\n200\n200\n\nTopological graph\nAuthorship network\n10M movie ratings\n\nSubset of english Wikipedia dump.\n\nTable 1: Details of datasets used in our experiments.\n\nIn our experiments, we measure overall running times which include the overheads for computing\nconnected components and allocating work in CYCLADES. We also compute the objective value at\nthe end of each epoch (i.e., one full pass over the data). We measure the speedups for each algorithm\nas time of the parallel algorithm to reach \u270f objective\ntime of the serial algorithm to reach \u270f objective where \u270f was chosen to be the smallest objective value that is\nachievable by all parallel algorithms on every choice of number of threads. The serial algorithm used\nfor comparison is HOGWILD! running serially on one thread. In Table 1 we list some details of the\ndatasets that we use in our experiments. We tune our constant stepsizes so to maximize convergence\n\n5Code is available at https://github.com/amplab/cyclades.\n\n5\n\n\fwithout diverging, and use one random data reshuf\ufb02ing across all epochs. Batch sizes are picked to\noptimize performance for CYCLADES.\n\n(a) Least Sq., DBLP,\nSAGA\n\n(b) Graph Eig., nh2010,\nSVRG\n\n(c) Mat. Comp., 10M, `2-\nSGD\n\n(d) Word2Vec, EN-Wiki,\nSGD\n\nFigure 3: Convergence of CYCLADES and HOGWILD! in terms of overall running time with 1, 8, 16, 18 threads.\nCYCLADES is initially slower, but ultimately reaches convergence faster than HOGWILD!.\n\n(a) Least Sq., DBLP,\nSAGA\n\n(b) Graph Eig., nh2010,\nSVRG\n\n(c) Mat. Comp., 10M, `2-\nSGD\n\n(d) Word2Vec, EN-Wiki,\nSGD\n\nFigure 4: Speedup of CYCLADES and HOGWILD! versus number of threads. On multiple threads, CYCLADES\nalways reaches \u270f objective faster than HOGWILD!. In some cases CYCLADES is faster than HOGWILD! even\non 1 thread, due to better cache locality. In Figs. 4(a) and 4(b), CYCLADES exhibits signi\ufb01cant gains since\nHOGWILD! suffers from asynchrony noise, and we had to use comparatively smaller stepsizes to prevent it from\ndiverging.\n\n1\n\nnPn\n\ni=1(aT\n\nLeast squares via SAGA The \ufb01rst problem we consider is least\nsquares: minx minx\ni x bi)2 which we will solve using\nthe SAGA algorithm [7], an incrimental gradient algorithm with\nfaster than SGD rates on convex, or strongly convex functions. In\nSAGA, we initialize gi = rfi(x0) and iterate the following two\nnPn\nsteps xk+1 = xk \u00b7 (rfsk (xk) gsk + 1\ni=1 gi) and gsk =\nrfsk (xk), where fi(x) = (aT\ni x bi)2. In the above iteration it is\nuseful to observe that the updates can be performed in a sparse and\n\u201clazy\u201d way, as we explain in detail in our supplemental material.\nThe stepsizes chosen for each of CYCLADES and HOGWILD! were\nlargest such that the algorithms did not diverge. We used the DBLP\nand NH2010 datasets for this experiment, and set A as the adjacency\nmatrix of each graph. For NH2010, the values of b were set to\npopulation living in the Census Block. For DBLP we used synthetic\nvalues: we set b = A\u02dcx + 0.1\u02dcz, where \u02dcx and \u02dcz were generated\nrandomly. The SAGA algorithm was run for 500 epochs for each\ndataset. When running SAGA for least squares, we found that\nHOGWILD! was divergent with the large stepsizes that we were\nusing for CYCLADES (Fig. 5). Thus, in the multi-thread setting,\nwe were only able to use smaller stepsizes for HOGWILD!, which\nresulted in slower convergence than CYCLADES, as seen in Fig. 3(a). The effects of a smaller stepsize\nfor HOGWILD! are also manifested in terms of speedups in Fig. 4(a), since HOGWILD! takes a longer\ntime to converge to an \u270f objective value.\n\nFigure 5: Convergence of CY-\nCLADES and HOGWILD! on least\nsquares using SAGA, with 16\nthreads, on DBLP dataset. HOG-\nWILD! diverges with > 105;\nthus, we were only able to use a\nsmaller step size = 105 for\nHOGWILD! on multiple threads.\nFor HOGWILD! on 1 thread (and\nCYCLADES on any number of\nthreads), we could use a larger\nstepsize of = 3 \u21e5 104.\n\nGraph eigenvector via SVRG Given an adjacency matrix A, the top eigenvector of AT A is useful\nin several applications such as spectral clustering, principle component analysis, and others. In a\n\n6\n\n\fn I aiaT\n\nn I aiaT\n\nn b.\n\n2 (kUk2\n\nF + kVk2\n\ni x 1\n\ni=1 1\n\n2 xT \n\ni x 1\n\nrecent work, [10] proposes an algorithm for computing the top eigenvector of AT A by running\nintermediate SVRG steps to approximate the shift-and-invert iteration. Speci\ufb01cally, at each step\n\nn bT x, where ai is the i-th column of\nSVRG is used to solve: minPn\nA. According to [10], if we initialize y = x0 and assume kaik = 1, we have to iterate the following\nupdates xk+1 = xk \u00b7 n \u00b7 (rfsk (xk) rfsk (y)) + \u00b7 rf (y) where after every T iterations we\nupdate y = xk, and the stochastic gradients are of the form rfi(x) = \nWe apply CYCLADES to the above SVRG iteration (see supplemental) for parallelizing this problem.\nWe run experiments on two graphs: DBLP and and NH2010. We ran SVRG for 50 and 100 epochs\nfor NH2010 and DBLP respectively. The convergence of SVRG for graph eigenvectors is shown\nin Fig. 3(b). CYCLADES starts off slower than HOGWILD!, but always produces results equivalent\nto the convergence on a single thread. HOGWILD! does not exhibit the same behavior on multiple\nthreads as it does serially; due to asynchrony causes HOGWILD! to converge slower on multiple\nthreads. This effect is clearly seen on Figs. 4(b), where HOGWILD! fails to converge faster than the\nserial counterpart, and CYCLADES attains a signi\ufb01cantly better speedup on 16 threads.\nMatrix completion and word embeddings via SGD In matrix completion we are given a partially\nobserved matrix M, and wish to factorize it as M \u21e1 UV where U and V are low rank matrices with\ndimensions n\u21e5 r and r\u21e5 m respectively. This may be achieved by optimizing minP(i,j)2\u2326(Mi,j \nF ) where \u2326 is the set of observed entries, which can be approximated\nUi,\u00b7V\u00b7,j)2 + \nby SGD on the observed samples. The regularized objective can be optimized by weighted SGD. In\nour experiments, we chose a rank of r = 100, and ran SGD and weighted SGD for 200 epochs. We\nused the MovieLens 10M dataset containing 10M ratings for 10K movies by 72K users.\nOur second task that uses SGD is word embeddings, which aim to represent\nthe mean-\ning of a word w via a vector vw 2 Rd. A recent work by [2] proposes to solve:\nmin{vw},CPw,w0 Aw,w0(log(Aw,w0) kvw + vw0k2\n2 C)2, where Aw,w0 is the number of times\nwords w and w0 co-occur within \u2327 words in the corpus. In our experiments we set \u2327 = 10 following\nthe suggested recipe of the aforementioned paper. We can approximate the solution to the above\nproblem, by obtaining one using SGD: we can repeatedly sample entries Aw,w0 from A and update\nthe corresponding vectors vw, vw0. Then, at the end of each full pass over the data, we update the\nconstant C by its locally optimal value, which can be calculated in closed form. In our experiments,\nwe optimized for a word embedding of dimension d = 100, and tested on a 80MB subset of the\nEnglish Wikipedia dump. For our experiments, we run SGD for 200 epochs.\nFigs. 3(c) and 3(d) show the convergence for the matrix completion and word embeddings prob-\nlems. CYCLADES is initially slower than HOGWILD! due to the overhead of computing connected\ncomponents. However, due to better cache locality and convergence properties, CYCLADES is able\nto reach a lower objective value in less time than HOGWILD!. In fact, we observe that CYCLADES\nis faster than HOGWILD! when both are run serially, demonstrating that the gains from (temporal)\ncache locality outweigh the coordination overhead of CYCLADES. These results are re\ufb02ected in the\nspeedups of CYCLADES and HOGWILD! (Figs. 4(c) and 4(d)). CYCLADES consistently achieves a\nbetter speedup (up to 11\u21e5 on 18 threads) compared to that of HOGWILD! (up to 9\u21e5 on 18 threads).\nPartitioning and allocation costs5 The cost of partitioning and allocation5 for CYCLADES is given\nin Table 2, relatively to the time that HOGWILD! takes to complete a single pass over the dataset. For\nmatrix completion and the graph eigenvector problem, on 18 threads, CYCLADES takes the equivalent\nof 4-6 epochs of HOGWILD! to complete its partitioning, as the problem is either very sparse or the\nupdates are expensive. For solving least squares using SAGA and word embeddings using SGD, the\ncost of partitioning is equivalent to 11-14 epochs of HOGWILD! on 18 threads. However, we point\nout that partitioning and allocation5 is a one-time cost which becomes cheaper with more stochastic\nupdate epochs. Additionally, note that this cost can become amortized due to the extra experiments\none has to run for hyperparameter tuning, since the graph partitioning is identical across different\nstepsizes one might want to test.\nBinary classi\ufb01cation and dense coordinates Here we explore settings where CYCLADES is ex-\npected to perform poorly due to the inherent density of updates (i.e., for data sets with dense features).\nIn particular, we test CYCLADES on a classi\ufb01cation problem for text based data. Speci\ufb01cally, we run\nclassi\ufb01cation for the URL dataset [15] contains \u21e0 2.4M URLs, labeled as either benign or malicious,\n5It has come to our attention post submission that parts of our partitioning and allocation code could be\n\nfurther parallelized. We refer the reader to our arXiv paper 1605.09721 for the latest results.\n\n7\n\n\f#\n\nLeast Squares\nthreads SAGA, DBLP\n\nGraph Eig.\n\nSVRG, NH2010\n\nMat. Comp.\n\n`2-SGD, MovieLens\n\nWord2Vec\n\nSGD, EN-Wiki\n\n1\n18\n\n2.2245\n14.1792\n\n0.9039\n4.7639\n\n0.5507\n5.5270\n\n0.5299\n3.9362\n\nTable 2: Ratio of the time that CYCLADES consumes for partition and allocation over the time that HOGWILD!\ntakes for 1 full pass over the dataset. On 18 threads, CYCLADES takes between 4-14 HOGWILD! epochs to\nperform partitioning. Note however, this computational effort is only required once per dataset.\n\nFigure 6: Speedups of CYCLADES and\nHOGWILD! on 16 threads, for differ-\nent percentage of dense features \ufb01ltered.\nWhen only a very small number of fea-\ntures are \ufb01ltered, CYCLADES is almost\nserial. However, as we increase the per-\ncentage from 0.016% to 0.048%, the\nspeedup of CYCLADES improves and\nalmost catches up with HOGWILD!.\n\nand 3.2M features, including bag-of-words representation of tokens in the URL. For this classi\ufb01cation\ntask, we used a logistic regression model, trained using SGD. By its power-law nature, the dataset\nconsists of a small number of extremely dense features which occur in nearly all updates. Since\nCYCLADES explicitly avoids con\ufb02icts, it has a schedule of SGD updates that leads to poor speedups.\nHowever, we observe that most con\ufb02icts are caused by a small\npercentage of the densest features. If these features are removed\nfrom the dataset, CYCLADES is able to obtain much better\nspeedups. The speedups that are obtained by CYCLADES and\nHOGWILD! on 16 threads for different \ufb01ltering percentages are\nshown in Figure 6. Full results of the experiment are presented\nin the supplemental material. CYCLADES fails to get much\nspeedup when nearly all the features are used. However, as\nmore dense features are removed, CYCLADES obtains a better\nspeedup, almost equalling HOGWILD!\u2019s speedup when 0.048%\nof the densest features are \ufb01ltered.\n5 Related work\nThe end of Moore\u2019s Law coupled with recent advances in par-\nallel and distributed computing technologies have triggered re-\nnewed interest in parallel stochastic optimization [26, 9, 1, 22].\nMuch of this contemporary work is built upon the foundational\nwork of Bertsekas, Tsitsiklis et al. [3, 23].\nInspired by HOGWILD!\u2019s success at achieving nearly linear speedups for a variety of machine learning\ntasks, several authors developed other lock-free and asynchronous optimization algorithms, such as\nparallel stochastic coordinate descent [13]. Additional work in \ufb01rst order optimization and beyond\n[8, 21, 5], has further demonstrated that linear speedups are generically possible in the asynchronous\nshared-memory setting.\nOther machine learning algorithms that have been parallelized using concurrency control, including\nnon-parametric clustering [18], submodular maximization [19], and correlation clustering [20].\nSparse, graph-based parallel computation are supported by systems like GraphLab [14]. These\nframeworks require computation to be written in a speci\ufb01c programming model with associative,\ncommutative operations. GraphLab and PowerGraph support serializable execution via locking\nmechanisms, this is in contrast to our partition-and-allocate coordination which allows us to provide\nguarantees on speedup.\n6 Conclusion\nWe presented CYCLADES, a general framework for lock-free parallelization of stochastic optimization\nalgorithms, while maintaining serial equivalence. Our framework can be used to parallelize a\nlarge family of stochastic updates algorithms in a con\ufb02ict-free manner, thereby ensuring that the\nparallelized algorithm produces the same result as its serial counterpart. Theoretical properties, such\nas convergence rates, are therefore preserved by the CYCLADES-parallelized algorithm, and we\nprovide a single uni\ufb01ed theoretical analysis that guarantees near linear speedups.\nBy eliminating con\ufb02icts across processors within each batch of updates, CYCLADES is able to avoid\nall asynchrony errors and con\ufb02icts, and leads to better cache locality and cache coherence than\nHOGWILD!. These features of CYCLADES translate to near linear speedups in practice, where it can\noutperform HOGWILD!-type of implementations by up to a factor of 5\u21e5, in terms of speedups.\nIn the future, we intend to explore hybrids of CYCLADES with HOGWILD!, pushing the boundaries\nof what is possible in a shared-memory setting. We are also considering solutions for scaling out in a\ndistributed setting, where the cost of communication is signi\ufb01cantly higher.\n\n8\n\n\fReferences\n[1] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In NIPS, pages 873\u2013881, 2011.\n[2] S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski. Rand-walk: A latent variable model approach to word\n\nembeddings. arXiv:1502.03520, 2015.\n\n[3] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23.\n\nPrentice hall Englewood Cliffs, NJ, 1989.\n\n[4] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an ef\ufb01cient and scalable\n\ndeep learning training system. In USENIX OSDI, 2014.\n\n[5] C. De Sa, C. Zhang, K. Olukotun, and C. R\u00e9. Taming the wild: A uni\ufb01ed analysis of hogwild!-style\n\nalgorithms. arXiv:1506.06438, 2015.\n\n[6] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al.\n\nLarge scale distributed deep networks. In NIPS 2012.\n\n[7] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for\n\nnon-strongly convex composite objectives. In NIPS, pages 1646\u20131654, 2014.\n\n[8] J. Duchi, M. I. Jordan, and B. McMahan. Estimation, optimization, and parallelism when data is sparse. In\n\nNIPS, pages 2832\u20132840, 2013.\n\n[9] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed\nIn Proceedings of the 17th ACM SIGKDD international conference on\n\nstochastic gradient descent.\nKnowledge discovery and data mining, pages 69\u201377. ACM, 2011.\n\n[10] C. Jin, S. M. Kakade, C. Musco, P. Netrapalli, and A. Sidford. Robust shift-and-invert preconditioning:\n\nFaster and more sample ef\ufb01cient algorithms for eigenvector computation. arXiv:1510.08896, 2015.\n\n[11] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nNIPS, pages 315\u2013323, 2013.\n\n[12] M. Krivelevich. The phase transition in site percolation on pseudo-random graphs. The Electronic Journal\n\nof Combinatorics, 23(1):1\u201312, 2016.\n\n[13] J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence\n\nproperties. SIAM Journal on Optimization, 25(1):351\u2013376, 2015.\n\n[14] Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Hellerstein. Graphlab: A new\n\nframework for parallel machine learning. arXiv:1408.2041, 2014.\n\n[15] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying suspicious urls: an application of large-scale\nonline learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages\n681\u2013688. ACM, 2009.\n\n[16] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate\n\nanalysis for asynchronous stochastic optimization. arXiv:1507.06970, 2015.\n\n[17] F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient\n\ndescent. In NIPS, pages 693\u2013701, 2011.\n\n[18] X. Pan, J. E. Gonzalez, S. Jegelka, T. Broderick, and M. I. Jordan. Optimistic concurrency control for\n\ndistributed unsupervised learning. In NIPS 26. 2013.\n\n[19] X. Pan, S. Jegelka, J. E. Gonzalez, J. K. Bradley, and M. I. Jordan. Parallel double greedy submodular\n\nmaximization. In NIPS 27. 2014.\n\n[20] X. Pan, D. Papailiopoulos, S. Oymak, B. Recht, K. Ramchandran, and M. I. Jordan. Parallel correlation\n\nclustering on big graphs. In NIPS, pages 82\u201390, 2015.\n\n[21] S. J. Reddi, A. Hefny, S. Sra, B. P\u00f3czos, and A. Smola. On variance reduction in stochastic gradient\n\ndescent and its asynchronous variants. arXiv:1506.06840, 2015.\n\n[22] P. Richt\u00e1rik and M. Tak\u00e1\u02c7c. Parallel coordinate descent methods for big data optimization. arXiv:1212.0873,\n\n2012.\n\n[23] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic\n\ngradient optimization algorithms. IEEE transactions on automatic control, 31(9):803\u2013812, 1986.\n\n[24] C. Zhang and C. R\u00e9. Dimmwitted: A study of main-memory statistical analytics. Proceedings of the VLDB\n\nEndowment, 7(12):1283\u20131294, 2014.\n\n[25] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A fast parallel sgd for matrix factorization in shared\nmemory systems. In Proceedings of the 7th ACM conference on Recommender systems, pages 249\u2013256.\nACM, 2013.\n\n[26] M. Zinkevich, J. Langford, and A. J. Smola. Slow learners are fast. In NIPS, pages 2331\u20132339, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1332, "authors": [{"given_name": "Xinghao", "family_name": "Pan", "institution": "UC Berkeley"}, {"given_name": "Maximilian", "family_name": "Lam", "institution": "UC Berkeley"}, {"given_name": "Stephen", "family_name": "Tu", "institution": "UC Berkeley"}, {"given_name": "Dimitris", "family_name": "Papailiopoulos", "institution": "UW-Madison"}, {"given_name": "Ce", "family_name": "Zhang", "institution": "Stanford"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}, {"given_name": "Kannan", "family_name": "Ramchandran", "institution": "UC Berkeley"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": null}]}