{"title": "Parallelizing Support Vector Machines on Distributed Computers", "book": "Advances in Neural Information Processing Systems", "page_first": 257, "page_last": 264, "abstract": "Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let $n$ denote the number of training instances, $p$ the reduced matrix dimension after factorization ($p$ is significantly smaller than $n$), and $m$ the number of machines. PSVM reduces the memory requirement from $\\MO$($n^2$) to $\\MO$($np/m$), and improves computation time to $\\MO$($np^2/m$). Empirical studies on up to $500$ computers shows PSVM to be effective.", "full_text": "PSVM: Parallelizing Support Vector Machines\n\non Distributed Computers\n\nEdward Y. Chang\u2217, Kaihua Zhu, Hao Wang, Hongjie Bai,\n\nJian Li, Zhihuan Qiu, & Hang Cui\nGoogle Research, Beijing, China\n\nAbstract\n\nSupport Vector Machines (SVMs) suffer from a widely recognized scalability\nproblem in both memory use and computational time. To improve scalability,\nwe have developed a parallel SVM algorithm (PSVM), which reduces memory\nuse through performing a row-based, approximate matrix factorization, and which\nloads only essential data to each machine to perform parallel computation. Let n\ndenote the number of training instances, p the reduced matrix dimension after\nfactorization (p is signi\ufb01cantly smaller than n), and m the number of machines.\nPSVM reduces the memory requirement from O(n2) to O(np/m), and improves\ncomputation time to O(np2/m). Empirical study shows PSVM to be effective.\nPSVM Open Source is available for download at http://code.google.com/p/psvm/.\n\n1 Introduction\n\nLet us examine the resource bottlenecks of SVMs in a binary classi\ufb01cation setting to explain our\nproposed solution. Given a set of training data X = {(xi, yi)|xi \u2208 Rd}n\ni=1, where xi is an obser-\nvation vector, yi \u2208 {\u22121, 1} is the class label of xi, and n is the size of X , we apply SVMs on X to\ntrain a binary classi\ufb01er. SVMs aim to search a hyperplane in the Reproducing Kernel Hilbert Space\n(RKHS) that maximizes the margin between the two classes of data in X with the smallest train-\ning error (Vapnik, 1995). This problem can be formulated as the following quadratic optimization\nproblem:\n\n\u03bei\n\n(1)\n\nmin P(w, b, \u03be) =\n2 + C\ns.t. 1 \u2212 yi(wT \u03c6(xi) + b) \u2264 \u03bei,\n\n(cid:107)w(cid:107)2\n\n1\n2\n\n\u03bei > 0,\n\nn(cid:88)\n\ni=1\n\nwhere w is a weighting vector, b is a threshold, C a regularization hyperparameter, and \u03c6(\u00b7) a basis\nfunction which maps xi to an RKHS space. The decision function of SVMs is f(x) = wT \u03c6(x)+ b,\nwhere the w and b are attained by solving P in (1). The optimization problem in (1) is the primal\nformulation of SVMs. It is hard to solve P directly, partly because the explicit mapping via \u03c6(\u00b7)\ncan make the problem intractable and partly because the mapping function \u03c6(\u00b7) is often unknown.\nThe method of Lagrangian multipliers is thus introduced to transform the primal formulation into\nthe dual one\n\n1\n2 \u03b1T Q\u03b1 \u2212 \u03b1T 1\nmin D(\u03b1) =\ns.t. 0 \u2264 \u03b1 \u2264 C, yT \u03b1 = 0,\n(cid:80)n\n\n(2)\n\nwhere [Q]ij = yiyj\u03c6T (xi)\u03c6(xj), and \u03b1 \u2208 Rn is the Lagrangian multiplier variable (or dual\nvariable). The weighting vector w is related with \u03b1 in w =\n\ni=1 \u03b1i\u03c6(xi).\n\n\u2217This work was initiated in 2005 when the author was a professor at UCSB.\n\n1\n\n\fThe dual formulation D(\u03b1) requires an inner product of \u03c6(xi) and \u03c6(xj). SVMs utilize the kernel\ntrick by specifying a kernel function to de\ufb01ne the inner-product K(xi, xj) = \u03c6T (xi)\u03c6(xj). We\nthus can rewrite [Q]ij as yiyjK(xi, xj). When the given kernel function K is psd (positive semi-\nde\ufb01nite), the dual problem D(\u03b1) is a convex Quadratic Programming (QP) problem with linear\nconstraints, which can be solved via the Interior-Point method (IPM) (Mehrotra, 1992). Both the\ncomputational and memory bottlenecks of the SVM training are the IPM solver to the dual formu-\nlation of SVMs in (2).\nCurrently, the most effective IPM algorithm is the primal-dual IPM (Mehrotra, 1992). The principal\nidea of the primal-dual IPM is to remove inequality constraints using a barrier function and then\nresort to the iterative Newton\u2019s method to solve the KKT linear system related to the Hessian matrix\nQ in D(\u03b1). The computational cost is O(n3) and the memory usage O(n2).\nIn this work, we propose a parallel SVM algorithm (PSVM) to reduce memory use and to parallelize\nboth data loading and computation. Given n training instances each with d dimensions, PSVM \ufb01rst\nloads the training data in a round-robin fashion onto m machines. The memory requirement per\nmachine is O(nd/m). Next, PSVM performs a parallel row-based Incomplete Cholesky Factor-\nization (ICF) on the loaded data. At the end of parallel ICF, each machine stores only a fraction\nof the factorized matrix, which takes up space of O(np/m), where p is the column dimension of\nthe factorized matrix. (Typically, p can be set to be about\nn without noticeably degrading train-\ning accuracy.) PSVM reduces memory use of IPM from O(n2) to O(np/m), where p/m is much\nsmaller than n. PSVM then performs parallel IPM to solve the quadratic optimization problem\nin (2). The computation time is improved from about O(n2) of a decomposition-based algorithm\n(e.g., SVMLight (Joachims, 1998), LIBSVM (Chang & Lin, 2001), SMO (Platt, 1998), and Sim-\npleSVM (Vishwanathan et al., 2003)) to O(np2/m). This work\u2019s main contributions are: (1) PSVM\nachieves memory reduction and computation speedup via a parallel ICF algorithm and parallel IPM.\n(2) PSVM handles kernels (in contrast to other algorithmic approaches (Joachims, 2006; Chu et al.,\n2006)). (3) We have implemented PSVM on our parallel computing infrastructures. PSVM effec-\ntively speeds up training time for large-scale tasks while maintaining high training accuracy.\nPSVM is a practical, parallel approximate implementation to speed up SVM training on today\u2019s\ndistributed computing infrastructures for dealing with Web-scale problems. What we do not claim\nare as follows: (1) We make no claim that PSVM is the sole solution to speed up SVMs. Algorithmic\napproaches such as (Lee & Mangasarian, 2001; Tsang et al., 2005; Joachims, 2006; Chu et al.,\n2006) can be more effective when memory is not a constraint or kernels are not used. (2) We do not\nclaim that the algorithmic approach is the only avenue to speed up SVM training. Data-processing\napproaches such as (Graf et al., 2005) can divide a serial algorithm (e.g., LIBSVM) into subtasks\non subsets of training data to achieve good speedup. (Data-processing and algorithmic approaches\ncomplement each other, and can be used together to handle large-scale training.)\n\n\u221a\n\n2 PSVM Algorithm\n\nThe key step of PSVM is parallel ICF (PICF). Traditional column-based ICF (Fine & Scheinberg,\n2001; Bach & Jordan, 2005) can reduce computational cost, but the initial memory requirement\nis O(np), and hence not practical for very large data set. PSVM devises parallel row-based ICF\n(PICF) as its initial step, which loads training instances onto parallel machines and performs factor-\nization simultaneously on these machines. Once PICF has loaded n training data distributedly on m\nmachines, and reduced the size of the kernel matrix through factorization, IPM can be solved on par-\nallel machines simultaneously. We present PICF \ufb01rst, and then describe how IPM takes advantage\nof PICF.\n\n2.1 Parallel ICF\nICF can approximate Q (Q \u2208 Rn\u00d7n) by a smaller matrix H (H \u2208 Rn\u00d7p, p (cid:191) n), i.e., Q \u2248\nHH T . ICF, together with SMW (the Sherman-Morrison-Woodbury formula), can greatly reduce\nthe computational complexity in solving an n \u00d7 n linear system. The work of (Fine & Scheinberg,\n2001) provides a theoretical analysis of how ICF in\ufb02uences the optimization problem in Eq.(2). The\nauthors proved that the error of the optimal objective value introduced by ICF is bounded by C 2l\u0001/2,\nwhere C is the hyperparameter of SVM, l is the number of support vectors, and \u0001 is the bound of\n\n2\n\n\fAlgorithm 1 Row-based PICF\n\nInput: n training instances; p: rank of ICF matrix H; m: number of machines\nOutput: H distributed on m machines\nVariables:\nv: fraction of the diagonal vector of Q that resides in local machine\nk: iteration number;\nxi: the ith training instance\nM: machine index set, M = {0, 1, . . . , m \u2212 1}\nIc: row-index set on machine c (c \u2208 M), Ic = {c, c + m, c + 2m, . . .}\n1: for i = 0 to n \u2212 1 do\n2:\n3: end for\n4: k \u2190 0; H \u2190 0; v \u2190 the fraction of the diagonal vector of Q that resides in local machine. (v(i)(i \u2208 Im)\n\nLoad xi into machine imodulom.\n\ncan be obtained from xi)\n\n5: Initialize master to be machine 0.\n6: while k < p do\n7:\n\nEach machine c \u2208 M selects its local pivot value, which is the largest element in v:\n\nand records the local pivot index, the row index corresponds to lpvk,c:\n\nlpvk,c = max\ni\u2208Ic\n\nv(i).\n\nlpik,c = arg max\ni\u2208Ic\n\nv(i).\n\nGather lpvk,c\u2019s and lpik,c\u2019s (c \u2208 M) to master.\nThe master selects the largest local pivot value as global pivot value gpvk and records in ik, row index\ncorresponding to the global pivot value.\n\ngpvk = max\nc\u2208M\n\nlpvk,c.\n\nThe master broadcasts gpvk and ik.\nChange master to machine ik%m.\nCalculate H(ik, k) according to (3) on master.\nThe master broadcasts the pivot instance xik and the pivot row H(ik, :). (Only the \ufb01rst k + 1 values of\nthe pivot row need to be broadcast, since the remainder are zeros.)\nEach machine c \u2208 M calculates its part of the kth column of H according to (4).\nEach machine c \u2208 M updates v according to (5).\nk \u2190 k + 1\n\n8:\n9:\n\n10:\n11:\n12:\n13:\n\n14:\n15:\n16:\n17: end while\n\n\u221a\n\nn, the error can be negligible.\n\nICF approximation (i.e. tr(Q \u2212 HH T ) < \u0001). Experimental results in Section 3 show that when p is\nset to\nOur row-based parallel ICF (PICF) works as follows: Let vector v be the diagonal of Q and suppose\nthe pivots (the largest diagonal values) are {i1, i2, . . . , ik}, the kth iteration of ICF computes three\nequations:\n\n(cid:112)\n\nH(Jk, k) = (Q(Jk, k) \u2212 k\u22121(cid:88)\n\nH(ik, k) =\n\nv(ik)\n\nj=1\n\nH(Jk, j)H(ik, j))/H(ik, k)\nv(Jk) = v(Jk) \u2212 H(Jk, k)2,\n\n(3)\n\n(4)\n\nk (measured by trace(Q \u2212 HkH T\n\n(5)\nwhere Jk denotes the complement of {i1, i2, . . . , ik}. The algorithm iterates until the approximation\nk )) is satisfactory, or the prede\ufb01ned maximum\nof Q by HkH T\niterations (or say, the desired rank of the ICF matrix) p is reached.\nAs suggested by G. Golub, a parallelized ICF algorithm can be obtained by constraining the par-\nallelized Cholesky Factorization algorithm, iterating at most p times. However, in the proposed\nalgorithm (Golub & Loan, 1996), matrix H is distributed by columns in a round-robin way on m\nmachines (hence we call it column-based parallelized ICF). Such column-based approach is opti-\nmal for the single-machine setting, but cannot gain full bene\ufb01t from parallelization for two major\nreasons:\n\n3\n\n\f(cid:80)k\u22121\n\n1. Large memory requirement. All training data are needed for each machine to calculate Q(Jk, k).\nTherefore, each machine must be able to store a local copy of the training data.\n2.\ncalculation\nj=1 H(Jk, j)H(ik, j)) in (4) can be parallelized. The calculation of pivot selection, the\n(\nsummation of local inner product result, column calculation in (4), and the vector update in (5)\nmust be performed on one single machine.\n\nparallelizable\n\ncomputation.\n\nLimited\n\nOnly\n\nthe\n\ninner\n\nproduct\n\nTo remedy these shortcomings of the column-based approach, we propose a row-based approach to\nparallelize ICF, which we summarize in Algorithm 1. Our row-based approach starts by initializing\nvariables and loading training data onto m machines in a round-robin fashion (Steps 1 to 5). The\nalgorithm then performs the ICF main loop until the termination criteria are satis\ufb01ed (e.g., the rank\nof matrix H reaches p). In the main loop, PICF performs \ufb01ve tasks in each iteration k:\n\u2022 Distributedly \ufb01nd a pivot, which is the largest value in the diagonal v of matrix Q (steps 7 to 10).\nNotice that PICF computes only needed elements in Q from training data, and it does not store Q.\n\u2022 Set the machine where the pivot resides as the master (step 11).\n\u2022 On the master, PICF calculates H(ik, k) according to (3) (step 12).\n\u2022 The master then broadcasts the pivot instance xik and the pivot row H(ik, :) (step 13).\n\u2022 Distributedly compute (4) and (5) (steps 14 and 15).\nAt the end of the algorithm, H is stored distributedly on m machines, ready for parallel IPM (pre-\nsented in the next section). PICF enjoys three advantages: parallel memory use (O(np/m)), parallel\ncomputation (O(p2n/m)), and low communication overhead (O(p2 log(m))). Particularly on the\ncommunication overhead, its fraction of the entire computation time shrinks as the problem size\ngrows. We will verify this in the experimental section. This pattern permits a larger problem to be\nsolved on more machines to take advantage of parallel memory use and computation.\n\n2.2 Parallel IPM\n\nAs mentioned in Section 1, the most effective algorithm to solve a constrained QP problem is the\nprimal-dual IPM. For detailed description and notations of IPM, please consult (Boyd, 2004; Mehro-\ntra, 1992). For the purpose of SVM training, IPM boils down to solving the following equations in\nthe Newton step iteratively.\n\n+ diag( \u03bbi\n\nC \u2212 \u03b1i\n\n)(cid:52)x\n\n(cid:182)\n\n(cid:181)\n(cid:181)\n\n1\n\n(cid:182)\n\n(cid:52)\u03bb = \u2212\u03bb + vec\n\n(cid:52)\u03be = \u2212\u03be + vec\n\nt(C \u2212 \u03b1i)\n1\nt\u03b1i\nyT \u03a3\u22121z + yT \u03b1\n\n(cid:52)\u03bd =\n\nyT \u03a3\u22121y\n\n\u2212 diag( \u03bei\n\u03b1i\n\n)(cid:52)x\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nD = diag( \u03bei\n+ \u03bbi\n\u03b1i\n(cid:52)x = \u03a3\u22121(z \u2212 y(cid:52)\u03bd),\n\nC \u2212 \u03b1i\n\n)\n\nwhere \u03a3 and z depend only on [\u03b1, \u03bb, \u03be, \u03bd] from the last iteration as follows:\n\n\u03a3 = Q + diag( \u03bei\n+ \u03bbi\n\u03b1i\n1\nz = \u2212Q\u03b1 + 1n \u2212 \u03bdy +\nt\n\nC \u2212 \u03b1i\nvec(\n\n)\n\n\u2212\n\n1\n\u03b1i\n\n1\n\nC \u2212 \u03b1i\n\n).\n\nThe computation bottleneck is on matrix inverse, which takes place on \u03a3 for solving (cid:52)\u03bd in (8)\nand (cid:52)x in (10). Equation (11) shows that \u03a3 depends on Q, and we have shown that Q can be\napproximated through PICF by HH T . Therefore, the bottleneck of the Newton step can be sped up\nfrom O(n3) to O(p2n), and be parallelized to O(p2n/m).\nDistributed Data Loading\nTo minimize both storage and communication cost, PIPM stores data distributedly as follows:\n\n4\n\n\f\u2022 Distribute matrix data. H is distributedly stored at the end of PICF.\n\u2022 Distribute n \u00d7 1 vector data. All n \u00d7 1 vectors are distributed in a round-robin fashion on m\nmachines. These vectors are z, \u03b1, \u03be, \u03bb, \u2206z, \u2206\u03b1, \u2206\u03be, and \u2206\u03bb.\n\u2022 Replicate global scalar data. Every machine caches a copy of global data including \u03bd, t, n, and\n\u2206\u03bd. Whenever a scalar is changed, a broadcast is required to maintain global consistency.\nParallel Computation of (cid:52)\u03bd\nRather than walking through all equations, we describe how PIPM solves (8), where \u03a3\u22121 appears\ntwice. An interesting observation is that parallelizing \u03a3\u22121z (or \u03a3\u22121y) is simpler than parallelizing\n\u03a3\u22121. Let us explain how parallelizing \u03a3\u22121z works, and parallelizing \u03a3\u22121y can follow suit.\nAccording to SMW (the Sherman-Morrison-Woodbury formula), we can write \u03a3\u22121z as\n\n\u03a3\u22121z = (D + Q)\u22121z \u2248 (D + HH T )\u22121z\n\n= D\u22121z \u2212 D\u22121H(I + H T D\u22121H)\u22121H T D\u22121z\n= D\u22121z \u2212 D\u22121H(GGT )\u22121H T D\u22121z.\n\n\u03a3\u22121z can be computed in four steps:\n1. Compute D\u22121z. D can be derived from locally stored vectors, following (9). D\u22121z is a n \u00d7 1\nvector, and can be computed locally on each of the m machines.\n2. Compute t1 = H T D\u22121z. Every machine stores some rows of H and their corresponding part\nof D\u22121z. This step can be computed locally on each machine. The results are sent to the master\n(which can be a randomly picked machine for all PIPM iterations) to aggregate into t1 for the next\nstep.\n3. Compute (GGT )\u22121t1. This step is completed on the master, since it has all the required\ndata. G can be obtained from H in a straightforward manner as shown in SMW. Computing\nt2 = (GGT )\u22121t1 is equivalent to solving the linear equation system t1 = (GGT )t2. PIPM \ufb01rst\nsolves t1 = Gy0, then y0 = GT t2. Once it has obtained y0, PIPM can solve GT t2 = y0 to obtain\nt2. The master then broadcasts t2 to all machines.\n4. Compute D\u22121Ht2 All machines have a copy of t2, and can compute D\u22121Ht2 locally to solve\nfor \u03a3\u22121z.\nSimilarly, \u03a3\u22121y can be computed at the same time. Once we have obtained both, we can solve \u2206\u03bd\naccording to (8).\n\n2.3 Computing b and Writing Back\n\nWhen the IPM iteration stops, we have the value of \u03b1 and hence the classi\ufb01cation function\n\nNs(cid:88)\n\nf(x) =\n\n\u03b1iyik(si, x) + b\n\ni=1\n\nHere Ns is the number of support vectors and si are support vectors. In order to complete this\nclassi\ufb01cation function, b must be computed. According to the SVM model, given a support vector s,\nif ys = \u22121.\nwe obtain one of the two results for f(s): f(s) = +1,\nIn practice, we can select M, say 1, 000, support vectors and compute the average of the bs in\nparallel using MapReduce (Dean & Ghemawat, 2004).\n\nif ys = +1, or f(s) = \u22121,\n\n3 Experiments\n\nWe conducted experiments on PSVM to evaluate its 1) class-prediction accuracy, 2) scalability on\nlarge datasets, and 3) overheads. The experiments were conducted on up to 500 machines in our\ndata center. Not all machines are identically con\ufb01gured; however, each machine is con\ufb01gured with\na CPU faster than 2GHz and memory larger than 4GBytes.\n\n5\n\n\fTable 1: Class-prediction Accuracy with Different p Settings.\n\ndataset\nsvmguide1\nmushrooms\nnews20\nImage\nCoverType\nRCV\n\nsamples (train/test)\n3, 089/4, 000\n7, 500/624\n18, 000/1, 996\n199, 957/84, 507\n522, 910/58, 102\n781, 265/23, 149\n\nLIBSVM p = n0.1\n0.6563\n0.9904\n0.6949\n0.7293\n0.9764\n0.8527\n\n0.9608\n1\n0.7835\n0.849\n0.9769\n0.9575\n\np = n0.2\n0.9\n0.9920\n0.6949\n0.7210\n0.9762\n0.8586\n\np = n0.3\n0.917\n1\n0.6969\n0.8041\n0.9766\n0.8616\n\np = n0.4\n0.9495\n1\n0.7806\n0.8121\n0.9761\n0.9065\n\np = n0.5\n0.9593\n1\n0.7811\n0.8258\n0.9766\n0.9264\n\n3.1 Class-prediction Accuracy\nPSVM employs PICF to approximate an n \u00d7 n kernel matrix Q with an n \u00d7 p matrix H. This\nexperiment evaluated how the choice of p affects class-prediction accuracy. We set p of PSVM to nt,\nwhere t ranges from 0.1 to 0.5 incremented by 0.1, and compared its class-prediction accuracy with\nthat achieved by LIBSVM. The \ufb01rst two columns of Table 1 enumerate the datasets and their sizes\nwith which we experimented. We use Gaussian kernel, and select the best C and \u03c3 for LIBSVM and\nPSVM, respectively. For CoverType and RCV, we loosed the terminate condition (set -e 1, default\n0.001) and used shrink heuristics (set -h 0) to make LIBSVM terminate within several days. The\ntable shows that when t is set to 0.5 (or p =\nn), the class-prediction accuracy of PSVM approaches\nthat of LIBSVM.\nWe compared only with LIBSVM because it is arguably the best open-source SVM implementa-\ntion in both accuracy and speed. Another possible candidate is CVM (Tsang et al., 2005). Our\nexperimental result on the CoverType dataset outperforms the result reported by CVM on the same\ndataset in both accuracy and speed. Moreover, CVM\u2019s training time has been shown unpredictable\nby (Loosli & Canu, 2006), since the training time is sensitive to the selection of stop criteria and\nhyper-parameters. For how we position PSVM with respect to other related work, please refer to\nour disclaimer in the end of Section 1.\n\n\u221a\n\n3.2 Scalability\n\nFor scalability experiments, we used three large datasets. Table 2 reports the speedup of PSVM\non up to m = 500 machines. Since when a dataset size is large, a single machine cannot store\nthe factorized matrix H in its local memory, we cannot obtain the running time of PSVM on one\nmachine. We thus used 10 machines as the baseline to measure the speedup of using more than\n10 machines. To quantify speedup, we made an assumption that the speedup of using 10 machines\nis 10, compared to using one machine. This assumption is reasonable for our experiments, since\nPSVM does enjoy linear speedup when the number of machines is up to 30.\n\nTable 2: Speedup (p is set to\n\n\u221a\n\nn); LIBSVM training time is reported on the last row for reference.\n\nImage (200k)\n\nSpeedup\n\nCoverType (500k)\nTime (s)\n\nSpeedup\n\nRCV (800k)\n\nTime (s)\n\nMachines\n\n10\n30\n50\n100\n150\n200\n250\n500\n\nTime (s)\n(9)\n(8)\n(14)\n(47)\n(40)\n(41)\n(78)\n(123)\n\n1, 958\n572\n473\n330\n274\n294\n397\n814\n\nLIBSVM 4, 334 NA\n\n10\u2217\n34.2\n41.4\n59.4\n71.4\n66.7\n49.4\n24.1\nNA\n\n16, 818\n(442)\n(10)\n5, 591\n(60)\n3, 598\n(29)\n2, 082\n1, 865\n(93)\n1, 416\n(24)\n(115)\n1, 405\n(34)\n1, 655\n28, 149 NA\n\n10\u2217\n30.1\n46.8\n80.8\n90.2\n118.7\n119.7\n101.6\nNA\n\n45, 135\n12, 289\n7, 695\n4, 992\n3, 313\n3, 163\n2, 719\n2, 671\n\n(1373)\n(98)\n(92)\n(34)\n(59)\n(69)\n(203)\n(193)\n\n184, 199 NA\n\nSpeedup\n\n10\u2217\n36.7\n58.7\n90.4\n136.3\n142.7\n166.0\n169.0\nNA\n\nWe trained PSVM three times for each dataset-m combination. The speedup reported in the table\nis the average of three runs with standard deviation provided in brackets. The observed variance in\nspeedup was caused by the variance of machine loads, as all machines were shared with other tasks\n\n6\n\n\frunning on our data centers. We can observe in Table 2 that the larger is the dataset, the better is\nthe speedup. Figures 1(a), (b) and (c) plot the speedup of Image, CoverType, and RCV, respectively.\nAll datasets enjoy a linear speedup when the number of machines is moderate. For instance, PSVM\nachieves linear speedup on RCV when running on up to around 100 machines. PSVM scales well till\naround 250 machines. After that, adding more machines receives diminishing returns. This result\nled to our examination on the overheads of PSVM, presented next.\n\n(a) Image (200k) speedup\n\n(b) Covertype (500k) speedup\n\n(c) RCV (800k) speedup\n\n(d) Image (200k) overhead\n\n(e) Covertype (500k) overhead\n\n(f) RCV (800k) overhead\n\n(g) Image (200k) fraction\n\n(h) Covertype (500k) fraction\n\n(i) RCV (800k) fraction\n\nFigure 1: Speedup and Overheads of Three Datasets.\n\n3.3 Overheads\n\nPSVM cannot achieve linear speedup when the number of machines continues to increase beyond\na data-size-dependent threshold. This is expected due to communication and synchronization over-\nheads. Communication time is incurred when message passing takes place between machines. Syn-\nchronization overhead is incurred when the master machine waits for task completion on the slowest\nmachine. (The master could wait forever if a child machine fails. We have implemented a check-\npoint scheme to deal with this issue.)\nThe running time consists of three parts: computation (Comp), communication (Comm), and syn-\nchronization (Sync). Figures 1(d), (e) and (f) show how Comm and Sync overheads in\ufb02uence the\nspeedup curves. In the \ufb01gures, we draw on the top the computation only line (Comp), which ap-\nproaches the linear speedup line. Computation speedup can become sublinear when adding ma-\nchines beyond a threshold. This is because the computation bottleneck of the unparallelizable step\n12 in Algorithm 1 (which computation time is O(p2)). When m is small, this bottleneck is insignif-\nicant in the total computation time. According to the Amdahl\u2019s law; however, even a small fraction\nof unparallelizable computation can cap speedup. Fortunately, the larger the dataset is, the smaller\nis this unparallelizable fraction, which is O(m/n). Therefore, more machines (larger m) can be\nemployed for larger datasets (larger n) to gain speedup.\n\n7\n\n\fWhen communication overhead or synchronization overhead is accounted for (the Comp + Comm\nline and the Comp + Comm + Sync line), the speedup deteriorates. Between the two overheads, the\nsynchronization overhead does not impact speedup as much as the communication overhead does.\nFigures 1(g), (h), and (i) present the percentage of Comp, Comm, and Sync in total running time.\nThe synchronization overhead maintains about the same percentage when m increases, whereas the\npercentage of communication overhead grows with m. As mentioned in Section 2.1, the communi-\ncation overhead is O(p2 log(m)), growing sub-linearly with m. But since the computation time per\nnode decreases as m increases, the fraction of the communication overhead grows with m. There-\nfore, PSVM must select a proper m for a training task to maximize the bene\ufb01t of parallelization.\n\n4 Conclusion\n\nIn this paper, we have shown how SVMs can be parallelized to achieve scalable performance. PSVM\ndistributedly loads training data on parallel machines, reducing memory requirement through ap-\nproximate factorization on the kernel matrix. PSVM solves IPM in parallel by cleverly arranging\ncomputation order. We have made PSVM open source at http://code.google.com/p/psvm/.\nAcknowledgement\n\nThe \ufb01rst author is partially supported by NSF under Grant Number IIS-0535085.\nReferences\nBach, F. R., & Jordan, M. I. (2005). Predictive low-rank decomposition for kernel methods. Pro-\n\nceedings of the 22nd International Conference on Machine Learning.\n\nBoyd, S. (2004). Convex optimization. Cambridge University Press.\nChang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software avail-\n\nable at http://www.csie.ntu.edu.tw/ cjlin/libsvm.\n\nChu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., & Olukotun, K. (2006). Map\n\nreduce for machine learning on multicore. NIPS.\n\nDean, J., & Ghemawat, S. (2004). Mapreduce: Simpli\ufb01ed data processing on large clusters.\n\nOSDI\u201904: Symposium on Operating System Design and Implementation.\n\nFine, S., & Scheinberg, K. (2001). Ef\ufb01cient svm training using low-rank kernel representations.\n\nJournal of Machine Learning Research, 2, 243\u2013264.\n\nGhemawat, S., Gobioff, H., & Leung, S.-T. (2003). The google \ufb01le system. 19th ACM Symposium\n\non Operating Systems Principles.\n\nGolub, G. H., & Loan, C. F. V. (1996). Matrix computations. Johns Hopkins University Press.\nGraf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I., & Vapnik, V. (2005). Parallel support vector\nmachines: The cascade svm. In Advances in neural information processing systems 17, 521\u2013528.\nJoachims, T. (1998). Making large-scale svm learning practical. Advances in Kernel Methods -\n\nSupport Vector Learning.\n\nJoachims, T. (2006). Training linear svms in linear time. ACM KDD, 217\u2013226.\nLee, Y.-J., & Mangasarian, O. L. (2001). Rsvm: Reduced support vector machines. First SIAM\n\nInternational Conference on Data Mining. Chicago.\n\nLoosli, G., & Canu, S. (2006). Comments on the core vector machines: Fast svm training on very\n\nlarge data sets (Technical Report).\n\nMehrotra, S. (1992). On the implementation of a primal-dual interior point method. SIAM J. Opti-\n\nmization, 2.\n\nPlatt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector\n\nmachines (Technical Report MSR-TR-98-14). Microsoft Research.\n\nTsang, I. W., Kwok, J. T., & Cheung, P.-M. (2005). Core vector machines: Fast svm training on\n\nvery large data sets. Journal of Machine Learning Research, 6, 363\u2013392.\n\nVapnik, V. (1995). The nature of statistical learning theory. New York: Springer.\nVishwanathan, S., Smola, A. J., & Murty, M. N. (2003). Simplesvm. ICML.\n\n8\n\n\f", "award": [], "sourceid": 435, "authors": [{"given_name": "Kaihua", "family_name": "Zhu", "institution": null}, {"given_name": "Hao", "family_name": "Wang", "institution": null}, {"given_name": "Hongjie", "family_name": "Bai", "institution": null}, {"given_name": "Jian", "family_name": "Li", "institution": null}, {"given_name": "Zhihuan", "family_name": "Qiu", "institution": null}, {"given_name": "Hang", "family_name": "Cui", "institution": null}, {"given_name": "Edward", "family_name": "Chang", "institution": null}]}