{"title": "Large-scale L-BFGS using MapReduce", "book": "Advances in Neural Information Processing Systems", "page_first": 1332, "page_last": 1340, "abstract": "L-BFGS has been applied as an effective parameter estimation method for various machine learning algorithms since 1980s. With an increasing demand to deal with massive instances and variables, it is important to scale up and parallelize L-BFGS effectively in a distributed system. In this paper, we study the problem of parallelizing the L-BFGS algorithm in large clusters of tens of thousands of shared-nothing commodity machines. First, we show that a naive implementation of L-BFGS using Map-Reduce requires either a significant amount of memory or a large number of map-reduce steps with negative performance impact. Second, we propose a new L-BFGS algorithm, called Vector-free L-BFGS, which avoids the expensive dot product operations in the two loop recursion and greatly improves computation efficiency with a great degree of parallelism. The algorithm scales very well and enables a variety of machine learning algorithms to handle a massive number of variables over large datasets. We prove the mathematical equivalence of the new Vector-free L-BFGS and demonstrate its excellent performance and scalability using real-world machine learning problems with billions of variables in production clusters.", "full_text": "Large-scale L-BFGS using MapReduce\n\nWeizhu Chen, Zhenghao Wang, Jingren Zhou\n\n{wzchen,zhwang,jrzhou}@microsoft.com\n\nMicrosoft\n\nAbstract\n\nL-BFGS has been applied as an effective parameter estimation method for various\nmachine learning algorithms since 1980s. With an increasing demand to deal\nwith massive instances and variables, it is important to scale up and parallelize\nL-BFGS effectively in a distributed system. In this paper, we study the problem\nof parallelizing the L-BFGS algorithm in large clusters of tens of thousands of\nshared-nothing commodity machines. First, we show that a naive implementation\nof L-BFGS using Map-Reduce requires either a signi\ufb01cant amount of memory or a\nlarge number of map-reduce steps with negative performance impact. Second, we\npropose a new L-BFGS algorithm, called Vector-free L-BFGS, which avoids the\nexpensive dot product operations in the two loop recursion and greatly improves\ncomputation ef\ufb01ciency with a great degree of parallelism. The algorithm scales\nvery well and enables a variety of machine learning algorithms to handle a massive\nnumber of variables over large datasets. We prove the mathematical equivalence\nof the new Vector-free L-BFGS and demonstrate its excellent performance and\nscalability using real-world machine learning problems with billions of variables\nin production clusters.\n\n1\n\nIntroduction\n\nIn the big data era, many applications require solving optimization problems with billions of vari-\nables on a huge amount of training data. Problems of this scale are more common nowadays, such\nas Ads CTR prediction[1] and deep neural network[2]. The other trend is the wide adoption of map-\nreduce [3] environments built with commodity hardware. Those large-scale optimization problems\nare often expected to be solved in a map-reduce environment where big data are stored.\nWhen a problem is with huge number of variables, it can be solved ef\ufb01ciently only if the storage and\ncomputation cost are maintained effectively. Among a diverse collection of large-scale optimization\nmethods, Limited-memory BFGS (L-BFGS)[4] is one of the frequently used optimization methods\nin practice[5]. In this paper, we study the L-BFGS implementation for billion-variable scale prob-\nlems in a map-reduce environment. The original L-BFGS algorithm and its update procedure were\nproposed in 1980s. A lot of popular optimization software packages implement it as a fundamen-\ntal building block. Approaches to apply it in a problem with up to millions of variables are well\nstudied and implemented in various optimization packages [6]. However, studies about how to scale\nL-BFGS into billions of variables are still in their very early stages. For such a massive scale, the\nparameters, their gradients, and the associated L-BFGS historical states are not only too large to\nbe stored in the memory of a single computation node, but also create too huge computation com-\nplexity for a processor or multicores to conquer it within reasonable time. Therefore, it is critical\nto explore an effective decomposition over both examples and models via distributed learning. Yet,\nto our knowledge, there is still very limited work to explore billion-variable scale L-BFGS. This\ndirectly leads to the consequence that very little work can scale various machine learning algorithms\nup to billion-variable scale using L-BFGS on map-reduce.\n\n1\n\n\fIn this paper, we start by carefully studying the implementation of L-BFGS in map-reduce environ-\nment. We examine two typical L-BFGS implementations in map-reduce and present their scaling\nobstacles. Particularly, given a problem with d variables and m historical states to approximate Hes-\nsian [5], traditional implementation[6][5], either need to store 2md variables in memory or need to\nperform 2m map-reduce steps per iteration. This clearly creates huge overhead for the problem with\nbillions of variables and prevents a scalable implementation in map-reduce.\nTo conquer these limitations, we reexamine the original L-BFGS algorithm and propose a new L-\nBFGS update procedure, called Vector-free L-BFGS (VL-BFGS), which is speci\ufb01cally devised for\ndistributed learning with huge number of variables. In particular, we replace the original L-BFGS\nupdate procedure depending on vector operations, as known as two-loop recursion, by a new proce-\ndure only relying on scalar operations. The new two-loop recursion in VL-BFGS is mathematically\nequivalent to the original algorithm but independent on the number of variable. Meanwhile, it re-\nduces the memory requirement from O(md) to O(m2) where d could be billion-scale but m is often\nless than 10. Alternatively, it only require 3 map-reduce steps compared to 2m map-reduce steps in\nanother naive implementation.\nThis new algorithm enables the implementation of a collection of machine learning algorithms to\nscale to billion variables in a map-reduce environment. We demonstrate its scalability and advantage\nover other approaches designed for large scale problems with billions of variables, and share our\nexperience after deploying it into an industrial cluster with tens of thousands of machines.\n\n2 Related Work\n\nL-BFGS [4][7] is a quasi-newton method based on the BFGS [8][9] update procedure, while main-\ntaining a compact approximation of Hessian with modest storage requirement. Traditional imple-\nmentation of L-BFGS follows [6] or [5] using the compact two-loop recursion update procedure.\nAlthough it has been applied in the industry to solve various optimization problems for decades,\nrecent work, such as [10][11], continue to demonstrate its reliability and effectiveness over other\noptimization methods. In contrast to our work, theirs implemented L-BFGS on a single machine\nwhile we focus on the L-BFGS implementation in a distributed environment.\nIn the context of distributed learning, there recently have been extensive research break-through.\nGraphLab [12] built a parallel distributed framework for graph computation.\n[13] introduced a\nframework to parallelize various machine learning algorithms in a multi-core environment. [14] ap-\nplied the ADMM technique into distributed learning. [15] proposed a delayed version of distributed\nonline learning. General distributed learning techniques closer to our work are the approaches based\non parallel gradient calculation followed by a centralized algorithm ([7][16][17]). Different from our\nwork, theirs built on fully connected environment such as MPI while we focus on the map-reduce\nenvironment with loose connection. Their centralized algorithm is often the bottleneck of the whole\nprocedure and limits the scalability of the algorithm. For example, [17] clearly stated that it is im-\npractical for their L-BFGS algorithm to run their large dataset due to huge memory consumption in\nthe centralized algorithm although L-BFGS has been shown to be an excellent candidate for their\nproblem. Moreover, the closest to our work lies in applying L-BFGS in the map-reduce-like environ-\nment, such as [18][2]. They are solving large-scale problems in a map-reduce adapted environment\nusing L-BFGS. [18] run L-BFGS on a map-reduce plus AllReduce environment to demonstrate the\npower of large-scale learning with map-reduce. Although it has been shown to scale up to billion\nof data instances with trillion entries in their data matrix, the number of variables in their problem\nis only about 16 million due to the constraints in centralized computation of L-BFGS direction. [2]\nused L-BFGS to solve the deep learning problem. It introduced the parameter servers to split a\nglobal model into multiple partitions and store each partition separately. Despite their successes,\nfrom the algorithmic point of view, their two-loop recursion update procedure is still highly de-\npendent on the number of variable. Compared with these work, our proposed two-loop recursion\nupdating procedure is independent on the number of variables and with much better parallelism.\nFurthermore, the proposed algorithm can run on pure map-reduce environment while previous work\n[2] and [18] require special components such as AllReduce or parameter servers. In addition, it is\nstraightforward for previous work, such as [2][18][17], to leverage our proposal to scale up their\nproblem into another order of magnitude in terms of number of variables.\n\n2\n\n\f3 L-BFGS Algorithm\n\nGiven an optimization problem with d variables, BFGS requires to store a dense d by d matrix to\napproximate the inverse Hessian, where L-BFGS only need to store a few vectors of length d to\napproximate the Hessian implicitly. Let us denote f as the objective function, g as the gradient and\n\u00b7 as the dot product between two vectors. L-BFGS maintains the historical states of previous m\n(generally m = 10) updates of current position x and its gradient g = \u2207f (x).\nIn L-BFGS algorithm, the historical states are represented as the last m updates of form sk =\nxk+1 \u2212 xk and yk = gk+1 \u2212 gk where sk represents the position difference and yk represents the\ngradient difference in iteration k. Each of them is a vector of length d. All of these 2m vector\nwith the original gradient gk will be used to calculate a new direction in line 3 of Algorithm 1.\nAlgorithm 1: L-BFGS Algorithm Outline\n\nInput: starting point x0, integer history size m > 0, k=1;\nOutput: the position x with a minimal objective function\n\n1 while no converge do\n2\n3\n4\n5\n6\n7\n8\n9 end\n\nCalculate gradient \u2207f (xk) at position xk ;\nCompute direction pk using Algorithm 2 ;\nCompute xk+1 = xk + \u03b1kpk where \u03b1k is chosen to satisfy Wolfe conditions;\nif k > m then\n\nDiscard vector pair sk\u2212m, yk\u2212m from memory storage;;\n\nend\nUpdate sk = xk+1 \u2212 xk, yk = \u2207f (xk+1) \u2212 \u2207f (xk), k = k + 1 ;\n\nAlgorithm 2: L-BFGS two-loop recursion\n\nInput: \u2207f (xk), si, yi where i = k \u2212 m, ..., k \u2212 1\nOutput: new direction p\n\n;\n\n1 p = \u2212\u2207f (xk) ;\n2 for i \u2190 k \u2212 1 to k \u2212 m do\n\u03b1i \u2190 si\u00b7p\n3\nsi\u00b7yi\np = p \u2212 \u03b1i \u00b7 yi ;\n4\n5 end\n6 p = ( sk\u22121\u00b7yk\u22121\nyk\u22121\u00b7yk\u22121\n7 for i \u2190 k \u2212 m to k \u2212 1 do\n\u03b2 = yi\u00b7p\n8\nsi\u00b7yi\np = p + (\u03b1i \u2212 \u03b2) \u00b7 si;\n9\n10 end\n\n)p\n\n;\n\nThe core update procedure in Algorithm 1 is the line 3 to calculate a new direction pk using s and\ny with current gradient \u2207f (xk). The most common approach for this calculation is the two-loop\nrecursion in Algorithm 2[5][6]. It initializes the direction p with gradient and continues to update it\nusing historical states y and s. More information about two-loop recursion could be found from [5].\n\n4 A Map-Reduce Implementation\n\nThe main procedure in Algorithm 1 lies in Line 2, 3 and 4. The calculation of gradient in Line 2\ncan be straightforwardly parallelized by dividing the data into multiple partitions. In the map-reduce\nenvironment, we can use one map step to calculate the partial gradient for partial data and one reduce\nto aggregate them into a global gradient vector. The veri\ufb01cation of the Wolfe condition only depends\non the calculation of the objective function following the line search procedure[5]. So thus Line 4\ncan also be easily parallelized following the same approach as Line 2. Therefore, the challenge in\nthe L-BFGS algorithm is Line 3. In other words,the dif\ufb01culties come from the calculation of the\ntwo-loop recursion, as shown in Algorithm 2.\n\n3\n\n\f4.1 Centralized Update\n\nThe simplest implementation for Algorithm 2 may be to run it in a single processor. We can easily\nperform this in a singleton reduce. However, the challenge is that Algorithm 2 requires 2m + 1\nvectors and each of them has a length of d. This could be feasible when d is in million scale. Nev-\nertheless, when d is in billion scale, either the storage or the computation cost becomes a signi\ufb01cant\nchallenge and makes it impractical to implement it in map-reduce. Given the Ads CTR prediction\ntask [1] as an example, there are more than 1 billion of features. If we set m = 10 in a linear model,\nit will produce 21 \u2217 1 = 21 billion variables. Even if we compactly use a single-precision \ufb02oating\npoint to represent a variable, it requires 84 GB memory to store the historical states and gradient.\nFor a map-reduce cluster built from commodity hardware and shared with other applications, this is\ngenerally unfeasible nowadays. For example, for the cluster into which we deployed the L-BFGS,\nits maximal memory limitation for a map-reduce step is 6 GB.\n\n4.2 Distributed Update\n\nDue to the storage limitation in centralized update, an alternative is to store s and y into multiple\npartitions without overlap and use a map-reduce step to calculate every dot product, such as si\u00b7p and\nsi\u00b7 yi in Line 3 of Algorithm 2. Yet, if each dot product within the for-loop in Algorithm 2 requires a\nmap-reduce step to perform the calculation, this will result in at least 2m map-reduce steps in a two-\nloop recursion. If we call Algorithm 2 for N times(iterations) in Algorithm 1, it will lead to 2mN\nmap-reduce steps. For example, if m = 10 and N = 100, this will produce 2000 map-reduce steps\nin a map-reduce job. Unfortunately, each map-reduce step will bring signi\ufb01cant overhead due to the\nscheduling cost and application launching cost. For a job with thousands of map-reduce steps, both\nthese cost often dominate the overall running time and make the useful computational time spent\nin algorithmic vector operations negligible. Moreover, given our current production cluster as an\nexample, a job with such a huge number of map-reduce step is too large for execution. It will trigger\na compilation timeout error before becoming too complicated for an execution engine to execute it.\n\n5 Vector-free L-BFGS\n\nFor the reasons mentioned, a feasible two-loop recursion procedure has to limit both the memory\nconsumption and the number of map-reduce steps per iteration. To strictly limit the memory con-\nsumption in Algorithm 2, we can not store the 2m + 1 vectors with length d in memory unless d is\nonly up to million scale. To comply with the allowable map-reduce steps per iteration, it is neither\npractical to perform map-reduce steps within the for-loop in Algorithm 2. Both of these assumptions\nmotivate us to carefully re-examine Algorithm 2 and lead to the proposed algorithm in this section.\n\n5.1 Basic Idea\n\nBefore illustrating the new procedure, let us describe following three observations in Algorithm 2\nthat guide the design of the new procedure in Algorithm 3:\n\n1. All inputs are invariable during Algorithm 2.\n2. All operations applied on p are linear with respect to the inputs. In other words, p could be\n\nformalized as a linear combination of the inputs although its coef\ufb01cients are unknown.\n\n3. The core numeric operation is the dot product between two vectors.\n\nObservation 1 and 2 motivate us to formalize the inputs as (2m + 1) invariable base vectors.\n\n(1)\n(2)\n(3)\nSo thus we can represent p as a linear combination of bi . Assume \u03b4 as the scalar coef\ufb01cients in this\nlinear combination, we can write p as:\n\nb1 = sk\u2212m, b2 = sk\u2212m+1, ..., bm = sk\u22121\nbm+1 = yk\u2212m, bm+2 = yk\u2212m+1, ..., b2m = yk\u22121\nb2m+1 = \u2207f (xi)\n\np =\n\n\u03b4kbk\n\n(4)\n\n2m+1(cid:88)\n\nk=1\n\n4\n\n\fSince bk are the inputs and invariants during the two-loop recursion, if we can calculate the coef\ufb01-\ncients \u03b4k, we can proceed to calculate the direction p.\nFollowing observation 3 with an re-examination of Algorithm 2, we classify the dot product opera-\ntions into two categories in terms of whether p is involved in the calculation. For the \ufb01rst category\nonly involving the dot product between the inputs (si, yi), a straightforward intuition is to pre-\ncompute their dot products to produce a scalar, so as to replace each dot product with a scalar in\nthe two-loop recursion. However, the second category of dot products involving p can not follow\nthis same procedure. Because the direction p is ever-changing during the for loop, any dot products\ninvolving p can not be settled or pre-computed. Fortunately, thanks to the linear decomposition of\np in observation 2 and Eqn.4, we can decompose any dot product involving p into a summation of\ndot products with its based vectors and corresponding coef\ufb01cients. This new elegant mathematical\nprocedure only happens after we formalize p as the linear combination of the base vectors.\n\n5.2 The VL-BFGS Algorithm\n\nWe present the algorithmic procedure in Algorithm 3. Let us denote the results of dot products\nbetween every two base vectors as a scalar matrix of (2m + 1) \u2217 (2m + 1) scalars. The proposed\nVL-BFGS algorithm only takes it as the input. Similar as the original L-BFGS algorithm, it has a\ntwo-loop recursion, but all the operations are only dependent on scalar operations. In Line 1-2, it\nassigns the initial values for \u03b4i. This is equivalent to Line 1 in Algorithm 2 to use opposite direction\nof gradient as the initial direction. The original calculation of \u03b1i in Line 6 relies on the direction\nvector p. It is worth noting that p is variable within the \ufb01rst loop in which \u03b4 is updated. So thus we\ncan not pre-compute any dot product involving p. However, as mentioned earlier and according to\nobservation 2 and Eqn.4, we can formalize bj \u00b7 p as a summation from a list of dot products between\nbase vectors and corresponding coef\ufb01cients, as shown in Line 6 of Algorithm 3. Meanwhile, since all\nbase vectors are invariable, their dot products can be pre-computed and replaced with scalars,which\nthen multiply the ever-changing \u03b4l. But these are only scalar operations and they are extremely\nef\ufb01cient. Line 7 continues to update scalar coef\ufb01cient \u03b4m+j, which is equivalent to update the\ndirection p with respect to the base vector bm+j or corresponding yj. This whole procedure is the\nsame when we apply it to Line 14 and 15. With the new formalization of p in Eqn.4 and the\n\nAlgorithm 3: Vector-free L-BFGS two-loop recursion\n\nInput: (2m + 1) \u2217 (2m + 1) dot product matrix between bi\nOutput: The coef\ufb01cients \u03b4i where i = 1, 2, ...2m + 1\n\n(cid:80)2m+1\n\n=\n\nl=1\n\nbj\u00b7bm+j\n\n\u03b4lbl\u00b7bj\n\n;\n\n1 for i \u2190 1 to 2m + 1 do\n\u03b4i = i \u2264 2m ? 0 : \u22121\n2\n3 end\n4 for i = k \u2212 1 to k \u2212 m do\nj = i \u2212 (k \u2212 m) + 1 ;\n5\n= bj\u00b7p\n\u03b1i \u2190 si\u00b7p\n6\nbj\u00b7bm+j\nsi\u00b7yi\n\u03b4m+j = \u03b4m+j \u2212 \u03b1i ;\n7\n8 end\n9 for i \u2190 1 to 2m + 1 do\n10\n11 end\n12 for i \u2190 k \u2212 m to k \u2212 1 do\n(cid:80)2m+1\nj = i \u2212 (k \u2212 m) + 1 ;\n13\n\u03b2 = bm+j\u00b7p\n14\nbj\u00b7bm+j\n\u03b4j = \u03b4j + (\u03b1i \u2212 \u03b2)\n15\n16 end\n\n\u03b4i = ( bm\u00b7b2m\nb2m\u00b7b2m\n\n)\u03b4i\n\n=\n\n\u03b4lbm+j\u00b7bl\n\nl=1\n\nbj\u00b7bm+j\n\n;\n\ninvariability of yi and si during Algorithm 2, Line 4 in Algorithm 2 updating with yi (equivalent to\nbm+j) is mathematically equivalent to Line 7 in Algorithm 3, so as Line 9 in Algorithm 2 and Line\n15 in Algorithm 3. For other lines between these two algorithms, it is easy to infer their equivalence\nwith the consideration of Eqn.1-4. Thus, Algorithm 3 is mathematically equivalent to Algorithm 2.\n\n5\n\n\f5.3 Complexity Analysis and Comparison\n\nUsing the dot product matrix of scalars as the input, the calculation in Algorithm 3 is substantially\nef\ufb01cient, since all the calculation is based on scalars. Altogether, it only requires 8m2 multiplications\nbetween scalars in the two for-loops. This is tiny compared to any vector operation involving billion-\nscale of variables. Thus, it is not necessary to parallelize Algorithm 3 in implementation.\nTo integrate Algorithm 3 as the core step in Algorithm 1, there are two extra steps we need to\nperform before and after it. One is to calculate the dot product matrix between the (2m + 1) base\nvectors. Because all base vectors have the same dimension d, we can partition them using the\nsame way and use one map-reduce step to calculate the dot product matrix. This computation is\ngreatly parallelizable and intrinsically suitable for map-reduce. Even without the consideration of\nparallization, a \ufb01rst glance tells us it may require about 4m2 dot products. However, since all the\nsi and yi except the \ufb01rst ones are unchanged in a new iteration, we can save the tiny dot product\nmatrix and reuse most entries across iterations. With the consideration of the commutative law of\nmultiplication since si \u00b7 yj \u2261 yj \u00b7 si, each new iteration only need to calculate 6m new dot products\nwhich involve new sk, yk and gk. Thus, the complexity is only 6md and this calculation is fully\nparallel in map-reduce, with each partition only calculating a small portion of 6md multiplications.\nThe other and the \ufb01nal step is to calculate the new direction p based on \u03b4i and the base vectors. The\ncomplexity is another 2md multiplications, which means the overall complexity of the algorithm\nis 8md multiplications. Since the overall \u03b4 is just a tiny vector with 2m + 1 dimensions, we can\njoin it with all the other base vectors, and then use the same approach as dot product calculation to\nproduce the \ufb01nal direction p using Eqn.4. A single map-reduce step is suf\ufb01cient for this \ufb01nal step.\nAltogether, without considering the gradient calculation which is same to all algorithms, VL-BFGS\nonly require 3 map-reduce steps for one iteration in the update.\nFor the centralized update approach in section 4.1, it also requires 6md multiplications in each\ntwo loop recursion. In addition to being a centralized approach, as we analyzed above, it requires\n(2m + 1) \u2217 d memory storage. This clearly limits its applications to large-scale problems. On the\nother hand, VL-BFGS in Algorithm 3 only requires (2m+1)2 memory storage and is independent on\nd. For the distributed approach in section 4.2, it requires at least 2m map-reduce step in a two-loop\nrecursion. Given the number of iteration as N (generally N > 100), the total number of map-reduce\nsteps is 2mN. Fortunately, the VL-BFGS only requires 3N map-reduce steps. In summary, VL-\nBFGS algorithm enjoys a similar overall complexity but it is born with massive degree of parallelism.\nFor problem with billion scale of variables, it is the only map-reduce friendly implementation of the\nthree different approaches.\n\n6 Experiment and Discussion\n\nAs demonstrated above, it is clear that VL-BFGS has a better scalability property than original L-\nBFGS. Although it is always desirable to invent an exact algorithm that could be mathematically\nproved to obtain a better scalability property, it is bene\ufb01cial to demonstrate the value of larger\nnumber of variables with an industrial application. On the other hand, for a problem with billions\nof variables, there are existing practical approaches to reduce it into a smaller number of variables\nand then solve it with traditional approaches designed for centralized algorithm. In this section, we\njustify the value of learning large scale variables and simultaneously compare it with the hashing\napproach, and \ufb01nally demonstrate the scalability advantage of VL-BFGS.\n\n6.1 Dataset and Experimental Setting\n\nThe dataset we used is from an Ads Click-through Rate (CTR) prediction problem [1] collected from\nan industrial search engine. The click event (click or not) is used as the label for each instance. The\nfeatures include the terms from a query and an Ad keyword along with the contextual information\nsuch as Ad position, session-related information and time. We collect 30 days of data and split them\ninto training and test set chronologically. The data from the \ufb01rst 20 days are used as the training\nset and rest 10 days are used as test set. The total training data have about 12 billions instances and\nanother 6 billion in testing data. There are 1,038,934,683 features the number of non-zero features\nper instance is about 100 on average. Altogether it has about 2 trillion entries in the data matrix.\n\n6\n\n\fTable 1: Relative AUC Performance over different number of variables\n\nK\n\nBaseline(K=1,038,934,683)\nK=250 millions\nK=100 millions\nK= 10 millions\nK= 1 millions\n\nRelative AUC Performance\n0.0%\n-0.1007388%\n-0.1902843%\n-0.3134094%\n-0.5701142%\n\nTable 2: Relative AUC Performance over different number of Hash bits\n\nK\n\nBaseline(K=1,038,934,683)\nK=64 millions(26 bits)\nK=16 millions(24 bits)\nK= 4 millions(22 bits)\nK= 1 millions(20 bits)\n\nRelative AUC Performance\n0.0%\n-0.1063033%\n-0.2323647%\n-0.3300788%\n-0.5080904%\n\nWe run logistic regression training, so thus each feature corresponds to a variable. The model is\nevaluated based on the testing data using Area Under ROC Curve [19], denoted as AUC. We set\nthe historical state length m = 10 and enforce L1[20] regularizer to avoid over\ufb01tting and achieve\nsparsity. The regularizer parameter is tuned following the approach in [18].\nWe run the experiment in a shared cluster with tens of thousands of machines. Each machine has up\nto 12 concurrent vertices. A vertex is generally a map or reduce step with an allocation of 2 cores\nand 6G memory. There are more than 1000 different jobs running simultaneously but this number\nalso varies signi\ufb01cantly. We split the training data into 400 partitions and allocate 400 tokens for this\njob, which means this job can use up to 400 vertices at the same time. When we partition vectors to\ncalculate their dot products, our strategy is to allocate up to 5 million entries in a partial vector. For\nexample, 1 billion variables will be split into 200 partitions evenly.\nWe use the model trained with original 1 billion features as the baseline. All the other experiments\nare compared with it. Since we are not allowed to exhibit the exact AUC number due to privacy\nconsideration, we report the relative change compared with the baseline. The scale of the dataset\nmakes any relative AUC change over 0.001% produce a p-value less than 0.01.\n\n6.2 Value of Large Number of Variables\n\nTo reduce the number of variables in the original problem, we sort the features based on their fre-\nquency in the training data. If we plan to reduce the problem to K variables, we keep the top K\nfrequent features. The baseline without \ufb01ltering is equivalent to K = 1, 038, 934, 683. We choose\ndifferent K values and report the relative AUC number in Table 1.\nThe table shows that while we reduce the number of variables, the results consistently decline signif-\nicantly. When the number of variables is 1 million, the drop is more than 0.5% . This is considerably\nsigni\ufb01cant for the problem. Even when we increase the number of variable up to 250 million, the\ndecline is still obvious and signi\ufb01cant. This demonstrates that the large number of variables is really\nneeded to learn a good model and the value of learning with billion-scale of variables.\n\n6.3 Comparison with Hashing\n\nWe follow the approach in [21][18] to calculate a new hash value for each original feature value\nbased on a hash function in [18]. The number of hash bits ranges from 20 to 26. Experimental\nresults compared with the baseline in terms of relative AUC performance are presented in Table 2\nConsistently with previous results, all the hashing experiments result in degradation. For the exper-\niment with 20 bits, the degradation is 0.5%. This is a substantial decline for this problem. When we\nincrease the number of bits till 26, the gap becomes smaller but still noticeable. All of these consis-\n\n7\n\n\ftently demonstrate that the hashing approach will sacri\ufb01ce noticeable performance. It is bene\ufb01cial\nto train with large-scale number of raw features.\n\n6.4 Training Time Comparison\n\nWe compare the L-BFGS in section 4.1 with the proposed VL-BFGS. To enable a larger number of\nvariable support for L-BFGS, we reduce the m parameter into 3. We conduct the experiments with\nvarying number of feature number and report their corresponding running time. We use the original\ndata after hashing into 1M features as the baseline and compare all the other experiments with it and\nreport the relative training time for same number of iterations. We run each experiment 5 times and\nreport their mean to cope with the variance in each run. The results with respect to different hash bits\nrange from 20 to 29 and the original 1B features are shown in \ufb01gure 1. When the number of features\nis less than 10M, the original L-BFGS has a small advantage over VL-BFGS. However, when we\ncontinue to increase the feature number, the running time of L-BFGS grows quickly while that of\nVL-BFGS increases slowly. On the other hand, when we increase the feature number to 512M, the\nL-BFGS fails with an out-of-memory exception, while VL-BFGS can easily scale to 1B features.All\nof these clearly show the scalability advantage of VL-BFGS over traditional L-BFGS.\n\nFigure 1: Training time over feature number.\n\n7 Conclusion\n\nWe have presented a new vector-free exact L-BFGS updating procedure called VL-BFGS. As op-\nposed to original L-BFGS algorithm in map-reduce, the core two-loop recursion in VL-BFGS is\nindependent on the number of variables. This enables it to be easily parallelized in map-reduce\nand scale up to billions of variables. We present its mathematical equivalence to original L-BFGS,\nshow its scalability advantage over traditional L-BFGS in map-reduce with a great degree of par-\nallelism, and perform experiments to demonstrate the value of large-scale learning with billions of\nvariables using VL-BFGS. Although we emphasis the implementation on map-reduce in this paper,\nVL-BFGS can be straightforwardly utilized by other distributed frameworks to avoid their central-\nized problem and scale up their algorithms. In short, VL-BFGS is highly bene\ufb01cial for machine\nlearning algorithms relying on L-BFGS to scale up to another order of magnitude.\n\n8\n\n\fReferences\n[1] T. Graepel, J.Q. Candela, T. Borchert, and R. Herbrich. Web-Scale Bayesian Click-Through\nRate Prediction for Sponsored Search Advertising in Microsofts Bing Search Engine. In Inter-\nnational Conference on Machine Learning, pages 13\u201320. Citeseer, 2010.\n\n[2] Jeffrey Dean, G Corrado, Rajat Monga, Kai Chen, and Matthieu Devin. Large Scale Distributed\nDeep Networks. Advances in Neural Information Processing Systems 25, pages 1232\u20131240,\n2012.\n\n[3] Jeffrey Dean and Sanjay Ghemawat. MapReduce : Simpli\ufb01ed Data Processing on Large Clus-\n\nters. Communications of the ACM, 51(1):1\u201313, 2008.\n\n[4] DC Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.\n\nMathematical programming, 45(1-3):503\u2013528, 1989.\n\n[5] J Nocedal and S J Wright. Numerical Optimization, volume 43 of Springer Series in Opera-\n\ntions Research. Springer, 1999.\n\n[6] C Zhu, RH Byrd, P Lu, and J Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for\nlarge-scale bound-constrained optimization. ACM Transactions on Mathematical Software, 23,\npages 550\u2013560, 1997.\n\n[7] Stephen G. Nash and Ariela Sofer. Block truncated-Newton methods for parallel optimization.\n\nMathematical Programming, 45(1-3):529\u2013546, 1989.\n\n[8] Jorge Nocedal. Updating quasi-Newton matrices with limited storage, 1980.\n[9] DF Shanno. On broyden-\ufb02etcher-goldfarb-shanno method. Journal of Optimization Theory\n\nand Applications, 1985.\n\n[10] N Schraudolph, J Yu, and S G\u00a8unter. A stochastic quasi-Newton method for online convex\n\noptimization. Journal of Machine Learning Research, pages 436\u2013443, 2007.\n\n[11] H Daum\u00b4e III. Notes on CG and LM-BFGS optimization of logistic regression. 2004.\n[12] Y Low, J Gonzalez, and A Kyrola. Graphlab: A new framework for parallel machine learning.\n\nUncertainty in Arti\ufb01cial Intelligence, 2010.\n\n[13] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng,\nand Kunle Olukotun. Map-Reduce for Machine Learning on Multicore. In Advances in Neural\nInformation Processing Systems 19, pages 281\u2013288. MIT Press, 2007.\n\n[14] S Boyd, N Parikh, and E Chu. Distributed optimization and statistical learning via the al-\nternating direction method of multipliers. Foundations and Trends in in Machine Learning,\n(3):1\u2013122, 2011.\n\n[15] J Langford, AJ Smola, and M Zinkevich. Slow learners are fast. Advances in Neural Informa-\n\ntion Processing Systems 22, pages 2331\u20132339, 2009.\n\n[16] C Teo, Le.Q, A Smola, and SVN Vishwanathan. A scalable modular convex solver for regular-\nized risk minimization. ACM SIGKDD Conference on Knowledge Discovery and Data Mining,\n2007.\n\n[17] S Gopal and Y Yang. Distributed training of Large-scale Logistic models. Proceedings of the\n\n30th International Conference on Machine Learning, 28:287\u2013297, 2013.\n\n[18] Alekh Agarwal, Oliveier Chapelle, Miroslav Dud\u00b4\u0131k, and John Langford. A Reliable Effective\nTerascale Linear Learning System. Journal of Machine Learning Research, 15:1111\u20131133,\n2014.\n\n[19] CX Ling, J Huang, and H Zhang. AUC: a statistically consistent and more discriminating\n\nmeasure than accuracy. IJCAI, pages 329\u2013341, 2003.\n\n[20] Galen Andrew and Jianfeng Gao. Scalable training of l1-regularized log-linear models. Pro-\n\nceedings of the 24th International Conference on Machine Learning, pages 33\u201340, 2007.\n\n[21] K Weinberger, A Dasgupta, J Langford, Smola.A, and J Attenberg. Feature hashing for large\n\nscale multitask learning. International Conference on Machine Learning, 2009.\n\n9\n\n\f", "award": [], "sourceid": 748, "authors": [{"given_name": "Weizhu", "family_name": "Chen", "institution": "Microsoft"}, {"given_name": "Zhenghao", "family_name": "Wang", "institution": "Microsoft"}, {"given_name": "Jingren", "family_name": "Zhou", "institution": "Microsoft"}]}