{"title": "Effective Parallelisation for Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6477, "page_last": 6488, "abstract": "We present a novel parallelisation scheme that simplifies the adaptation of learning algorithms to growing amounts of data as well as growing needs for accurate and confident predictions in critical applications. In contrast to other parallelisation techniques, it can be applied to a broad class of learning algorithms without further mathematical derivations and without writing dedicated code, while at the same time maintaining theoretical performance guarantees. Moreover, our parallelisation scheme is able to reduce the runtime of many learning algorithms to polylogarithmic time on quasi-polynomially many processing units. This is a significant step towards a general answer to an open question on efficient parallelisation of machine learning algorithms in the sense of Nick's Class (NC). The cost of this parallelisation is in the form of a larger sample complexity. Our empirical study confirms the potential of our parallelisation scheme with fixed numbers of processors and instances in realistic application scenarios.", "full_text": "Effective Parallelisation for Machine Learning\n\nMichael Kamp\nUniversity of Bonn\nand Fraunhofer IAIS\n\nkamp@cs.uni-bonn.de\n\nOlana Missura\n\nGoogle Inc.\n\nolanam@google.com\n\nMario Boley\n\nMax Planck Institute for Informatics\n\nand Saarland University\n\nmboley@mpi-inf.mpg.de\n\nThomas G\u00a8artner\n\nUniversity of Nottingham\n\nthomas.gaertner@nottingham.ac.uk\n\nAbstract\n\nWe present a novel parallelisation scheme that simpli\ufb01es the adaptation of learn-\ning algorithms to growing amounts of data as well as growing needs for accurate\nand con\ufb01dent predictions in critical applications. In contrast to other paralleli-\nsation techniques, it can be applied to a broad class of learning algorithms with-\nout further mathematical derivations and without writing dedicated code, while\nat the same time maintaining theoretical performance guarantees. Moreover, our\nparallelisation scheme is able to reduce the runtime of many learning algorithms\nto polylogarithmic time on quasi-polynomially many processing units. This is a\nsigni\ufb01cant step towards a general answer to an open question on the ef\ufb01cient par-\nallelisation of machine learning algorithms in the sense of Nick\u2019s Class (NC). The\ncost of this parallelisation is in the form of a larger sample complexity. Our empir-\nical study con\ufb01rms the potential of our parallelisation scheme with \ufb01xed numbers\nof processors and instances in realistic application scenarios.\n\n1\n\nIntroduction\n\nThis paper contributes a novel and provably effective parallelisation scheme for a broad class of\nlearning algorithms. The signi\ufb01cance of this result is to allow the con\ufb01dent application of machine\nlearning algorithms with growing amounts of data. In critical application scenarios, i.e., when errors\nhave almost prohibitively high cost, this con\ufb01dence is essential [27, 36]. To this end, we consider the\nparallelisation of an algorithm to be effective if it achieves the same con\ufb01dence and error bounds as\nthe sequential execution of that algorithm in much shorter time. Indeed, our parallelisation scheme\ncan reduce the runtime of learning algorithms from polynomial to polylogarithmic. For that, it\nconsumes more data and is executed on a quasi-polynomial number of processing units.\nTo formally describe and analyse our parallelisation scheme, we consider the regularised risk min-\nimisation setting. For a \ufb01xed but unknown joint probability distribution D over an input space X\nand an output space Y, a dataset D \u2286 X \u00d7 Y of size N \u2208 N drawn iid from D, a convex hypothesis\nspace F of functions f : X \u2192 Y, a loss function (cid:96) : F \u00d7 X \u00d7 Y \u2192 R that is convex in F, and a\nconvex regularisation term \u2126 : F \u2192 R, regularised risk minimisation algorithms solve\n\nL(D) = argmin\nf\u2208F\n\n(cid:96) (f, x, y) + \u2126(f ) .\n\n(cid:88)\n\n(x,y)\u2208D\n\nThe aim of this approach is to obtain a hypothesis f \u2208 F with small regret\n.\n\nQ (f ) = E [(cid:96) (f, x, y)] \u2212 argmin\nf(cid:48)\u2208F\n\nE [(cid:96) (f(cid:48), x, y)]\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n(1)\n\n(2)\n\n\fP (Q (L(D)) > \u03b5) \u2264 \u2206 .\n\nRegularised risk minimisation algorithms are typically designed to be consistent and ef\ufb01cient. They\nare consistent if there is a function N0 : R+ \u00d7 R+ \u2192 R+ such that for all \u03b5 > 0, \u2206 \u2208 (0, 1], N \u2208 N\nwith N \u2265 N0(\u03b5, \u2206), and training data D \u223c DN , the probability of generating an \u03b5-bad hypothesis\nis no greater than \u2206, i.e.,\n\n(3)\nThey are ef\ufb01cient if the sample complexity N0(\u03b5, \u2206) is polynomial in 1/\u03b5, log 1/\u2206 and the runtime\ncomplexity TL is polynomial in the sample complexity. This paper considers the parallelisation of\nsuch consistent and ef\ufb01cient learning algorithms, e.g., support vector machines, regularised least\nsquares regression, and logistic regression. We additionally assume that data is abundant and that\nF can be parametrised in a \ufb01xed, \ufb01nite dimensional Euclidean space Rd such that the convexity of\nthe regularised risk minimisation problem (Equation 1) is preserved. In other cases, (non-linear)\nlow-dimensional embeddings [2, 28] can preprocess the data to facilitate parallel learning with our\nscheme. With slight abuse of notation, we identify the hypothesis space with its parametrisation.\nThe main theoretical contribution of this paper is to show that algorithms satisfying the above con-\nditions can be parallelised effectively. We consider a parallelisation to be effective if the (\u03b5, \u2206)-\nguarantees (Equation 3) are achieved in time polylogarithmic in N0(\u03b5, \u2206). The cost for achieving\nthis reduction in runtime comes in the form of an increased data size and in the number of processing\nunits used. For the parallelisation scheme presented in this paper, we are able to bound this cost by\na quasi-polynomial in 1/\u03b5 and log 1/\u2206. The main practical contribution of this paper is an effective\nparallelisation scheme that treats the underlying learning algorithm as a black-box, i.e., it can be\nparallelised without further mathematical derivations and without writing dedicated code.\nSimilar to averaging-based parallelisations [32, 45, 46], we apply the underlying learning algorithm\nin parallel to random subsets of the data. Each resulting hypothesis is assigned to a leaf of an\naggregation tree which is then traversed bottom-up. Each inner node computes a new hypothesis\nthat is a Radon point [30] of its children\u2019s hypotheses. In contrast to aggregation by averaging, the\nRadon point increases the con\ufb01dence in the aggregate doubly-exponentially with the height of the\naggregation tree. We describe our parallelisation scheme, the Radon machine, in detail in Section 2.\nComparing the Radon machine to the underlying learning algorithm which is applied to a dataset of\nthe size necessary to achieve the same con\ufb01dence, we are able to show a reduction in runtime from\npolynomial to polylogarithmic in Section 3.\nThe empirical evaluation of the Radon machine in Section 4 con\ufb01rms its potential in practical\nsettings. Given the same amount of data as the underlying learning algorithm, the Radon ma-\nchine achieves a substantial reduction of computation time in realistic applications. Using 150 pro-\ncessors, the Radon machine is between 80 and around 700-times faster than the underlying learning\nalgorithm on a single processing unit. Compared with parallel learning algorithms from Spark\u2019s\nMLlib, it achieves hypotheses of similar quality, while requiring only 15 \u2212 85% of their runtime.\nParallel computing [18] and its limitations [13] have been studied for a long time in theoretical com-\nputer science [7]. Parallelising polynomial time algorithms ranges from being \u2018embarrassingly\u2019 [26]\neasy to being believed to be impossible. For the class of decision problems that are the hardest in P,\ni.e., for P-complete problems, it is believed that there is no ef\ufb01cient parallel algorithm in the sense\nof Nick\u2019s Class (NC [9]): ef\ufb01cient parallel algorithms in this sense are those that can be executed\nin polylogarithmic time on a polynomial number of processing units. Our paper thus contributes to\n\nlearning algorithm L, dataset D \u2286 X \u00d7 Y, Radon number r \u2208 N, and parameter h \u2208 N\n\nAlgorithm 1 Radon Machine\nInput:\nOutput: hypothesis f \u2208 F\n1: divide D into rh iid subsets Di of roughly equal size\n2: run L in parallel to obtain fi = L(Di)\n3: S \u2190 {f1, . . . , frh}\n4: for i = h \u2212 1, . . . , 1 do\n5:\n6:\n7:\n8: end for\n9: return r(S)\n\npartition S into iid subsets S1, . . . , Sri of size r each\ncalculate Radon points r(S1), . . . , r(Sri) in parallel\nS \u2190 {r(S1), . . . , r(Sri)}\n\n2\n\n# see De\ufb01nition 1 and Appendix C.1\n\n\funderstanding the extent to which ef\ufb01cient parallelisation of polynomial time learning algorithms is\npossible. This connection and other approaches to parallel learning are discussed in Section 5.\n\n2 From Radon Points to Radon Machines\n\nThe Radon machine, described in Algorithm 1, \ufb01rst executes the underlying (base) learning algo-\nrithm on random subsets of the data to quickly achieve weak hypotheses and then iteratively aggre-\ngates them to stronger ones. Both the generation of weak hypotheses and the aggregation can be\nexecuted in parallel. To aggregate hypotheses, we follow along the lines of the iterated Radon point\nalgorithm which was originally devised to approximate the centre point, i.e., a point of largest Tukey\ndepth [38], of a \ufb01nite set of points [8]. The Radon point [30] of a set of points is de\ufb01ned as follows:\nDe\ufb01nition 1. A Radon partition of a set S \u2282 F is a pair A, B \u2282 S such that A \u2229 B = \u2205 but\n(cid:104)A(cid:105) \u2229 (cid:104)B(cid:105) (cid:54)= \u2205, where (cid:104)\u00b7(cid:105) denotes the convex hull. The Radon number of a space F is the smallest\nr \u2208 N such that for all S \u2282 F with |S| \u2265 r there is a Radon partition; or \u221e if no S \u2282 F with\nRadon partition exists. A Radon point of a set S with Radon partition A, B is any r \u2208 (cid:104)A(cid:105) \u2229 (cid:104)B(cid:105).\nWe now present the Radon machine (Algorithm 1), which is able to effectively parallelise consistent\nand ef\ufb01cient learning algorithms. Input to this parallelisation scheme is a learning algorithm L on\na hypothesis space F, a dataset D \u2286 X \u00d7 Y, the Radon number r \u2208 N of the hypothesis space\nF, and a parameter h \u2208 N. It divides the dataset into rh subsets D1, . . . , Drh (line 1) and runs the\nalgorithm L on each subset in parallel (line 2). Then, the set of hypotheses (line 3) is iteratively\naggregated to form better sets of hypotheses (line 4-8). For that the set is partitioned into subsets of\nsize r (line 5) and the Radon point of each subset is calculated in parallel (line 6). The \ufb01nal step of\neach iteration is to replace the set of hypotheses by the set of Radon points (line 7).\nThe scheme requires a hypothesis space with a valid notion of convexity and \ufb01nite Radon number.\nWhile other notions of convexity are possible [16, 33], in this paper we restrict our consideration to\nEuclidean spaces with the usual notion of convexity. Radon\u2019s theorem [30] states that the Euclidean\nspace Rd has Radon number r = d + 2. Radon points can then be obtained by solving a system\nof linear equations of size r \u00d7 r (to be fully self-contained we state the system of linear equations\nexplicitly in Appendix C.1). The next proposition gives a guarantee on the quality of Radon points:\nProposition 2. Given a probability measure P over a hypothesis space F with \ufb01nite Radon number\nr, let F denote a random variable with distribution P . Furthermore, let r be the random variable\nobtained by computing the Radon point of r random points drawn according to P r. Then it holds\nfor the expected regret Q and all \u03b5 \u2208 R that\n\nP (Q (r) > \u03b5) \u2264 (rP (Q (F ) > \u03b5))2 .\n\nThe proof of Proposition 2 is provided in Section 7. Note that this proof also shows the robustness\nof the Radon point compared to the average: if only one of r points is \u03b5-bad, the Radon point is\nstill \u03b5-good, while the average may or may not be; indeed, in a linear space with any set of \u03b5-good\nhypotheses and any \u03b5(cid:48) \u2265 \u03b5, we can always \ufb01nd a single \u03b5(cid:48)-bad hypothesis such that the average of\nall these hypotheses is \u03b5(cid:48)-bad.\nA direct consequence of Proposition 2 is a bound on the probability that the output of the Radon\nmachine with parameter h is bad:\nTheorem 3. Given a probability measure P over a hypothesis space F with \ufb01nite Radon number r,\nlet F denote a random variable with distribution P . Denote by r1 the random variable obtained by\ncomputing the Radon point of r random points drawn iid according to P and by P1 its distribution.\nFor any h \u2208 N, let rh denote the Radon point of r random points drawn iid from Ph\u22121 and by Ph its\ndistribution. Then for any convex function Q : F \u2192 R and all \u03b5 \u2208 R it holds that\n\nP (Q(rh) > \u03b5) \u2264 (rP (Q(F ) > \u03b5))2h\n\n.\n\nThe proof of Theorem 3 is also provided in Section 7. For the Radon machine with parameter\nh, Theorem 3 shows that the probability of obtaining an \u03b5-bad hypothesis is doubly exponentially\nreduced: with a bound \u03b4 on this probability for the base learning algorithm, the bound \u2206 on this\nprobability for the Radon machine is\n\n(4)\nIn the next section we compare the Radon machine to its base learning algorithm which is applied\nto a dataset of the size necessary to achieve the same \u03b5 and \u2206.\n\n\u2206 = (r\u03b4)2h\n\n.\n\n3\n\n\f3 Sample and Runtime Complexity\nIn this section we \ufb01rst derive the sample and runtime complexity of the Radon machine R from\nthe sample and runtime complexity of the base learning algorithm L. We then relate the runtime\ncomplexity of the Radon machine to an application of the base learning algorithm which achieves\nthe same (\u03b5, \u2206)-guarantee. For that, we consider consistent and ef\ufb01cient base learning algorithms\n0 (\u03b5, \u03b4) = (\u03b1\u03b5 + \u03b2\u03b5 ld 1/\u03b4)k, for some1 \u03b1\u03b5, \u03b2\u03b5 \u2208 R, and\nwith a sample complexity of the form NL\nk \u2208 N. From now on, we also assume that \u03b4 \u2264 1/2r for the base learning algorithm.\nThe Radon machine creates rh base hypotheses and, with \u2206 as in Equation 4, has sample complexity\n\nNR\n0 (\u03b5, \u2206) = rhNL\n\n0 (\u03b5, \u03b4) = rh \u00b7\n\n(5)\nTheorem 3 then implies that the Radon machine with base learning algorithm L is consistent: with\nN \u2265 NR\nTo achieve the same guarantee as the Radon machine, the application of the base learning algorithm\nL itself (sequentially) would require M \u2265 NL\n\n0 (\u03b5, \u2206) samples it achieves an (\u03b5, \u2206)-guarantee.\n\n0 (\u03b5, \u2206) samples, where\n\n\u03b1\u03b5 + \u03b2\u03b5 ld\n\n1\n\u03b4\n\n.\n\n(cid:19)k\n\n(cid:18)\n\n(cid:16)\n\n\u03b5, (r\u03b4)2h(cid:17)\n\n(cid:18)\n\n(cid:19)k\n\nNL\n0 (\u03b5, \u2206) = NL\n\n0\n\n\u03b1\u03b5 + 2h \u00b7 \u03b2\u03b5 ld\n\n1\nr\u03b4\n\n.\n\n=\n\n(6)\nFor base learning algorithms L with runtime TL(n) polynomial in the data size n \u2208 N, i.e.,\nTL(n) \u2208 O (n\u03ba) with \u03ba \u2208 N, we now determine the runtime TR,h(N ) of the Radon machine with\nh iterations and c = rh processing units on N \u2208 N samples. In this case all base learning algo-\nrithms can be executed in parallel. In practical applications fewer physical processors can be used\nto simulate rh processing units\u2014we discuss this case in Section 5.\nThe runtime of the Radon machine can be decomposed into the runtime of the base learning al-\ngorithm and the runtime for the aggregation. The base learning algorithm requires n \u2265 NL\n0 (\u03b5, \u03b4)\nsamples and can be executed on rh processors in parallel in time TL(n). The Radon point in each of\nthe h iterations can then be calculated in parallel in time r3 (see Appendix C.1). Thus, the runtime\nof the Radon machine with N = rhn samples is\n\n(7)\nIn contrast,\nthe runtime of the base learning algorithm for achieving the same guarantee is\nTL(M ) with M \u2265 NL\n0 (\u03b5, \u2206) behaves as\n0 (\u03b5, \u03b4). To obtain polylogarithmic runtime of R compared to TL(M ), we choose the parame-\n2hNL\nter h \u2248 ld M \u2212 ld ld M such that n \u2248 M/2h = ld M. Thus, the runtime of the Radon machine is in\n\nO(cid:0)ld\u03ba M + r3 ld M(cid:1). This result is formally summarised in Theorem 4.\n\nTR,h(N ) = TL (n) + hr3 .\nIgnoring logarithmic and constant terms, NL\n\n0 (\u03b5, \u2206).\n\nTheorem 4. The Radon machine with a consistent and ef\ufb01cient regularised risk minimisation al-\ngorithm on a hypothesis space with \ufb01nite Radon number has polylogarithmic runtime on quasi-\npolynomially many processing units if the Radon number can be upper bounded by a function poly-\nlogarithmic in the sample complexity of the ef\ufb01cient regularised risk minimisation algorithm.\n\nThe theorem is proven in Appendix A.1 and relates to Nick\u2019s Class [1]: A decision problem can\nbe solved ef\ufb01ciently in parallel in the sense of Nick\u2019s Class, if it can be decided by an algorithm\nin polylogarithmic time on polynomially many processors (assuming, e.g., PRAM model). For the\nclass of decision problems that are the hardest in P , i.e., for P -complete problems, it is believed\nthat there is no ef\ufb01cient parallel algorithm for solving them in this sense. Theorem 4 provides a\nstep towards \ufb01nding ef\ufb01cient parallelisations of regularised risk minimisers and towards answering\nthe open question: is consistent regularised risk minimisation possible in polylogarithmic time on\npolynomially many processors. A similar question, for the case of learning half spaces, has been\ncalled a fundamental open problem by Long and Servedio [21] who gave an algorithms which runs\non polynomially many processors in time that depends polylogarithmically on the sample size but\nis inversely proportional to a parameter of the learning problem. While Nick\u2019s Class as a notion of\nef\ufb01ciency has been criticised [17], it is the only notion of ef\ufb01ciency that forms a proper complexity\nclass in the sense of Blum [4]. To overcome the weakness of using only this notion, Kruskal et al.\n[17] suggested to consider also the inef\ufb01ciency of simulating the parallel algorithm on a single\nprocessing unit. We discuss the inef\ufb01ciency and the speed-up in Appendix A.2.\n\n1We derive \u03b1\u03b5, \u03b2\u03b5 for hypothesis spaces with \ufb01nite VC [41] and Rademacher [3] complexity in App. C.2.\n\n4\n\n\f4 Empirical Evaluation\n\nThis empirical study compares the Radon machine to state-of-the-art parallel machine learning al-\ngorithms from the Spark machine learning library [25], as well as the natural baseline of averaging\nhypotheses instead of calculating their Radon point (averaging-at-the-end, Avg). We use base learn-\ning algorithms from WEKA [44] and scikit-learn [29]. We compare the Radon machine to the base\nlearning algorithms on moderately sized datasets, due to scalability limitations of the base learners,\nand reserve larger datasets for the comparison with parallel learners. The experiments are executed\non a Spark cluster (5 worker nodes, 25 processors per node)2. All results are obtained using 10-fold\ncross validation. We apply the Radon machine with parameter h = 1 and the maximal parame-\nter h such that each instance of the base learning algorithm is executed on a subset of size at least\n100 (denoted h = max). Averaging-at-the-end executes the base learning algorithm on the same\nnumber of subsets rh as the Radon machine with that parameter and is denoted in the Figures by\nstating the parameter h as for the Radon machine. All other parameters of the learning algorithms\nare optimised on an independent split of the datasets. See Appendix B for additional details.\nWhat is the speed-up of our scheme in practice? In Figure 1(a), we compare the Radon machine to\nits base learners on moderately sized datasets (details on the datasets are provided in Appendix B).\n\n2The source code implementation in Spark can be found in the bitbucket repository\n\nhttps://bitbucket.org/Michael_Kamp/radonmachine.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Runtime (log-scale) and AUC of base learners and their parallelisation using the Radon\nmachine (PRM) for 6 datasets with N \u2208 [488 565, 5 000 000], d \u2208 [3, 18]. Each point represents the\naverage runtime (upper part) and AUC (lower part) over 10 folds of a learner\u2014or its parallelisation\u2014\non one datasets. (b) Runtime and AUC of the Radon machine compared to the averaging-at-the-end\nbaseline (Avg) on 5 datasets with N \u2208 [5 000 000, 32 000 000], d \u2208 [18, 2 331].\n(c) Runtime\nand AUC of several Spark machine learning library algorithms and the Radon machine using base\nlearners that are comparable to the Spark algorithms on the same datasets as in Figure 1(b).\n\n5\n\n102103104105106training time (log-scale)WekaSGDWekaLogRegLinearSVCPRM(h=1)[WekaSGD]PRM(h=1)[WekaLogReg]PRM(h=1)[LinearSVC]PRM(h=max)[WekaSGD]PRM(h=max)[WekaLogReg]PRM(h=max)[LinearSVC]codrnaStagger1SEA_50pokerclick_predSUSY0.00.20.40.60.81.0AUC102103104105Avg(h=1)[WekaSGD]Avg(h=max)[WekaSGD]Avg(h=1)[WekaLogReg]Avg(h=max)[WekaLogReg]PRM(h=1)[WekaSGD]PRM(h=1)[WekaLogReg]PRM(h=max)[WekaSGD]PRM(h=max)[WekaLogReg]20_newsSUSYHIGGSwikidataCASP90.00.20.40.60.81.0102103104105SparkLogRegwSGDSparkSVMwSGDSparkLogRegwLBFGSSparkLogRegPRM(h=1)[WekaSGD]PRM(h=1)[WekaLogReg]PRM(h=max)[WekaSGD]PRM(h=max)[WekaLogReg]20_newsSUSYHIGGSwikidataCASP90.00.20.40.60.81.0\fFigure 2: Speed-up (log-scale) of the Radon\nmachine over its base learners per dataset from\nthe same experiment as in Figure 1(a).\n\nFigure 3: Dependence of the runtime on the\ndataset size for of the Radon machine com-\npared to its base learners.\n\nThere, the Radon machine is between 80 and\naround 700-times faster than the base learner\nusing 150 processors. The speed-up is de-\ntailed in Figure 2. On the SUSY dataset\n(with 5 000 000 instances and 18 features),\nthe Radon machine on 150 processors with\nh = 3 is 721 times faster than its base learn-\ning algorithms. At the same time, their predic-\ntive performances, measured by the area under\nthe ROC curve (AUC) on an independent test\ndataset, are comparable.\nHow does the scheme compare to averaging-\nat-the-end? In Figure 1(b) we compare the\nruntime and AUC of the parallelisation scheme\nagainst the averaging-at-the-end baseline (Avg).\nIn terms of the AUC, the Radon machine out-\nperforms the averaging-at-the-end baseline on\nall datasets by at least 10%. The runtimes can\nhardly be distinguished in that \ufb01gure. A small\ndifference can however be noted in Figure 4\nwhich is discussed in more details in the next\nparagraph. Since averaging is less computa-\ntionally expensive than calculating the Radon\npoint, the runtimes of the averaging-at-the-end\nbaselines are slightly lower than the ones of\nthe Radon machine. However, compared to the\ncomputational complexity of executing the base\nlearner, this advantage becomes negligible.\nHow does our scheme compare to state-of-the-art Spark machine learning algorithms? We\ncompare the Radon machine to various Spark machine learning algorithms on 5 large datasets. The\nresults in Figure 1(c) indicate that the proposed parallelisation scheme with h = max has a substan-\ntially smaller runtime than the Spark algorithms on all datasets. On the SUSY and HIGGS dataset,\nthe Radon machine is one order of magnitude faster than the Spark implementations\u2014here the com-\nparatively small number of features allows for a high level of parallelism. On the CASP9 dataset,\nthe Radon machine is 15% faster than the fastest Spark algorithm. The performance in terms of AUC\nof the Radon machine is similar to the Spark algorithms. In particular, when using WekaLogReg\nwith h = max, the Radon machine outperforms the Spark algorithms in terms of AUC and runtime\non the datasets SUSY, wikidata, and CASP9. Details are given in the Appendix B. A summarizing\ncomparison of the parallel approaches in terms of their trade-off between runtime and predictive\nperformance is depicted in Figure 4. Here, results are shown for the Radon machine and averaging-\nat-the-end with parameter h = max and for the two Spark algorithms most similar to the base\n\nFigure 4: Representation of the results in Fig-\nure 1(b) and 1(c) in terms of the trade-off between\nruntime and AUC for the Radon machine (PRM)\nand averaging-at-the-end (Avg), both with param-\neter h = max, and parallel machine learning\nalgorithms in Spark. The dashed lines connect\nthe Radon machine to averaging-at-the-end with\nthe same base learning algorithm and a compara-\nble Spark machine learning algorithm.\n\n6\n\npokerStagger1SEA_50SUSYclick_predcodrna101102103speedupWekaSGDWekaLogRegLinearSVC106107dataset size103104105106runtime1.571.17centralRadon103104training time0.10.20.30.40.50.60.70.8AUC20_newsgroupsSUSYHIGGSwikidataCASP9PRM(h=max)[WekaSGD]PRM(h=max)[WekaLogReg]Avg(h=max)[WekaSGD]Avg(h=max)[WekaLogReg]SparkSVMwSGDSparkLogRegwLBFGS\flearning algorithms. Note that it is unclear what caused the consistently weak performance of all\nalgorithms on wikidata. Nonetheless, the results show that on all datasets the Radon machine has\ncomparable predictive performance to the Spark algorithms and substantially higher predictive per-\nformance than averaging-at-the-end. At the same time, the Radon machine has a runtime comparable\nto averaging-at-the-end on all datasets and both are substantially faster than the Spark algorithms.\nHow does the runtime depend on the dataset size in a real-world system? The runtime of\nthe Radon machine can be distinguished into its learning phase and its aggregation phase. While\nthe learning phase fully bene\ufb01ts from parallelisation, this comes at the cost of additional runtime for\nthe aggregation phase. The time for aggregating the hypotheses does not depend on the number of\ninstances in the dataset but for a \ufb01xed parameter h it only depends on the dimension of the hypothesis\nspace and that parameter. In Figure 3 we compare the runtimes of all base learning algorithms per\ndataset size to the Radon machines. Results indicate that, while the runtimes of the base learning\nalgorithms depends on the dataset size with an average exponent of 1.57, the runtime of the Radon\nmachine depends on the dataset size with an exponent of only 1.17.\nHow generally applicable is the scheme? As an indication of the general applicability in practice,\nwe also consider regression and multi-class classi\ufb01cation. For regression, we apply the scheme to\nthe Scikit-learn implementation of regularised least squares regression [29]. On the dataset YearPre-\ndictionMSD, regularised least squares regression achieves an RMSE of 12.57, whereas the Radon\nmachine achieved an RMSE of 13.64. At the same time, the Radon machine is 197-times faster. We\nalso compare the Radon machine on a multi-class prediction problem using conditional maximum\nentropy models. For multi-class classi\ufb01cation, we use the implementation described in Mcdonald\net al. [23], who propose to use averaging-at-the-end for distributed training. We compare the Radon\nmachine to averaging-at-the-end with conditional maximum entropy models on two large multi-\nclass datasets (drift and spoken-arabic-digit). On average, our scheme performs better with only\nslightly longer runtime. The minimal difference in runtime can be explained\u2014similar to the results\nin Figure 1(b)\u2014by the smaller complexity of calculating the average instead of the Radon point.\n\n5 Discussion and Future Work\n\nIn the experiments we considered datasets where the number of dimensions is much smaller than\nthe number of instances. What about high-dimensional models? The basic version of the paral-\nlelisation scheme presented in this paper cannot directly be applied to cases in which the size of the\ndataset is not at least a multiple of the Radon number of the hypothesis space. For various types of\ndata such as text, this might cause concerns. However, random projections [15] or low-rank approx-\nimations [2, 28] can alleviate this problem and are already frequently employed in machine learning.\nAn alternative might be to combine our parallelisation scheme with block coordinate descent [37].\nIn this case, the scheme can be applied iteratively to subsets of the features.\nIn the experiments we considered only linear models. What about non-linear models? Learning\nnon-linear models causes similar problems to learning high-dimensional ones. In non-parametric\nmethods like kernel methods, for instance, the dimensionality of the optimisation problem is equal\nto the number of instances, thus prohibiting the application of our parallelisation scheme. How-\never, similar low-rank approximation techniques as described above can be applied with non-linear\nkernels [11]. Furthermore, methods for speeding up the learning process for non-linear models ex-\nplicitly approximate an embedding in which then a linear model can be learned [31]. Using explicitly\nconstructed feature spaces, Radon machines can directly be applied to non-linear models.\nWe have theoretically analysed our parallelisation scheme for the case that there are enough process-\ning units available to \ufb01nd each weak hypothesis on a separate processing units. What if there are\nnot rh, but only c < rh processing units? The parallelisation scheme can quite naturally be \u201cde-\nparallelised\u201d and partially executed in sequence. For the runtime this implies an additional factor of\nmax{1, rh/c}. Thus, the Radon machine can be applied with any number of processing units.\nThe scheme improves \u2206 doubly exponentially in its parameter h but for that it requires the weak hy-\npotheses to already achieve \u03b4 \u2264 1/2r. Is the scheme only applicable in high-con\ufb01dence domains?\nMany application scenarios require high-con\ufb01dence error bounds, e.g., in the medical domain [27]\nor in intrusion detection [36]. In practice our scheme achieves similar predictive quality much faster\nthan its base learner.\n\n7\n\n\fBesides runtime, communication plays an essential role in parallel learning. What is the commu-\nnication complexity of the scheme? As for all aggregation at the end strategies, the overall amount\nof communication is low compared to periodically communicating schemes. For the parallel aggre-\ngation of hypotheses, the scheme requires O(rh+1) messages (which can be sent in parallel) each\ncontaining a single hypothesis of size O(r). Our scheme is ideally suited for inherently distributed\ndata and might even mitigate privacy concerns.\nIn a lot of applications data is available in the form of potentially in\ufb01nite data streams. Can the\nscheme be applied to distributed data streams? For each data stream, a hypotheses could be\nmaintained using an online learning algorithm and periodically aggregated using the Radon machine,\nsimilar to the federated learning approach proposed by McMahan et al. [24].\nIn this paper, we investigated the parallelisation of machine learning algorithms. Is the Radon\nmachine more generally applicable? The parallelisation scheme could be applied to more general\nrandomized convex optimization algorithms with unknown and random target functions. We will\ninvestigate its applicability for learning in non-Euclidean, abstract convexity spaces.\n\n6 Conclusion and Related Work\n\nIn this paper we provided a step towards answering an open problem: Is parallel machine learn-\ning possible in polylogarithmic time using a polynomial number of processors only? This question\nhas been posed for half-spaces by Long and Servedio [21] and called \u201ca fundamental open prob-\nlem about the abilities and limitations of ef\ufb01cient parallel learning algorithms\u201d. It relates machine\nlearning to Nick\u2019s Class of parallelisable decision problems and its variants [13]. Early theoretical\ntreatments of parallel learning with respect to NC considered probably approximately correct (PAC)\n[5, 39] concept learning. Vitter and Lin [42] introduced the notion of NC-learnable for concept\nclasses for which there is an algorithm that outputs a probably approximately correct hypothesis in\npolylogarithmic time using a polynomial number of processors. In this setting, they proved positive\nand negative learnability results for a number of concept classes that were previously known to be\nPAC-learnable in polynomial time. More recently, the special case of learning half spaces in par-\nallel was considered by Long and Servedio [21] who gave an algorithm for this case that runs on\npolynomially many processors in time that depends polylogarithmically on the size of the instances\nbut is inversely proportional to a parameter of the learning problem. Our paper complements these\ntheoretical treatments of parallel machine learning and provides a provably effective parallelisation\nscheme for a broad class of regularised risk minimisation algorithms.\nSome parallelisation schemes also train learning algorithms on small chunks of data and average the\nfound hypotheses. While this approach has advantages [12, 32], current error bounds do not allow\na derivation of polylogarithmic runtime [20, 35, 45] and it has been doubted to have any bene\ufb01t\nover learning on a single chunk [34]. Another popular class of parallel learning algorithms is based\non stochastic gradient descent, targeting expected risk minimisation directly [34, and references\ntherein]. The best so far known algorithm in this class [34] is the distributed mini-batch algorithm\n[10]. This algorithm still runs for a number of rounds inversely proportional to the desired opti-\nmisation error, hence not in polylogarithmic time. A more traditional approach is to minimise the\nempirical risk, i.e., an empirical sample-based approximation of the expected risk, using any, deter-\nministic or randomised, optimisation algorithm. This approach relies on generalisation guarantees\nrelating the expected and empirical risk minimisation as well as a guarantee on the optimisation error\nintroduced by the optimisation algorithm. The approach is readily parallelisable by employing avail-\nable parallel optimisation algorithms [e.g., 6]. It is worth noting that these algorithms solve a harder\nthan necessary optimisation problem and often come with prohibitively high communication cost in\ndistributed settings [34]. Recent results improve over these [22] but cannot achieve polylogarithmic\ntime as the number of iterations depends linearly on the number of processors.\nApart from its theoretical advantages, the Radon machine also has several practical bene\ufb01ts. In\nparticular, it is a black-box parallelisation scheme in the sense that it is applicable to a wide range\nof machine learning algorithms and it does not depend on the implementation of these algorithms.\nIt speeds up learning while achieving a similar hypothesis quality as the base learner. Our empirical\nevaluation indicates that in practice the Radon machine achieves either a substantial speed-up or a\nhigher predictive performance than other parallel machine learning algorithms.\n\n8\n\n\f7 Proof of Proposition 2 and Theorem 3\n\nIn order to prove Proposition 2 and consecutively Theorem 3, we \ufb01rst investigate some properties of\nRadon points and convex functions. We proof these properties for the more general case of quasi-\nconvex functions. Since every convex function is also quasi-convex, the results hold for convex\nfunctions as well. A quasi-convex function is de\ufb01ned as follows.\nDe\ufb01nition 5. A function Q : F \u2192 R is called quasi-convex if all its sublevel sets are convex, i.e.,\n\n\u2200\u03b8 \u2208 R : {f \u2208 F | Q (f ) < \u03b8} is convex.\n\nFirst we give a different characterisation of quasi-convex functions.\nProposition 6. A function Q : F \u2192 R is quasi-convex if and only if for all S \u2286 F and all s(cid:48) \u2208 (cid:104)S(cid:105)\nthere exists an s \u2208 S with Q (s) \u2265 Q (s(cid:48)).\n\nProof.\n\n(\u21d2) Suppose this direction does not hold. Then there is a quasi-convex function Q, a set S \u2286 F,\nand an s(cid:48) \u2208 (cid:104)S(cid:105) such that for all s \u2208 S it holds that Q (s) < Q (s(cid:48)) (therefore s(cid:48) /\u2208 S). Let\nC = {c \u2208 F | Q (c) < Q (s(cid:48))}. As S \u2286 C = (cid:104)C(cid:105) we also have that (cid:104)S(cid:105) \u2286 (cid:104)C(cid:105) which\ncontradicts (cid:104)S(cid:105) (cid:51) s(cid:48) /\u2208 C.\n\n(\u21d0) Suppose this direction does not hold.\n\nThen there exists an \u03b5 such that\nS = {s \u2208 F | Q (s) < \u03b5} is not convex and therefore there is an s(cid:48) \u2208 (cid:104)S(cid:105) \\ S. By as-\nsumption \u2203s \u2208 S : Q (s) \u2265 Q (s(cid:48)). Hence Q (s(cid:48)) < \u03b5 and we have a contradiction since\nthis would imply s(cid:48) \u2208 S.\n\nThe next proposition concerns the value of any convex function at a Radon point.\nProposition 7. For every set S with Radon point r and every quasi-convex function Q it holds that\n|{s \u2208 S | Q (s) \u2265 Q (r)}| \u2265 2.\n\n(cid:84)\n\ni(cid:104)Ai(cid:105) (cid:54)= \u2205 and r \u2208 (cid:84)\n\nProof. We show a slightly stronger result: Take any family of pairwise disjoint sets Ai with\ni(cid:104)Ai(cid:105). From proposition 6 follows directly the existence of an ai \u2208 Ai\nsuch that Q (ai) \u2265 Q (r). The desired result follows then from ai (cid:54)= aj \u21d0 i (cid:54)= j.\n\nWe are now ready to proof Proposition 2 and Theorem 3 (which we re-state here for convenience).\nTheorem 3. Given a probability measure P over a hypothesis space F with \ufb01nite Radon number r,\nlet F denote a random variable with distribution P . Denote by r1 the random variable obtained by\ncomputing the Radon point of r random points drawn iid according to P and by P1 its distribution.\nFor any h \u2208 N, let rh denote the Radon point of r random points drawn iid from Ph\u22121 and by Ph its\ndistribution. Then for any convex function Q : F \u2192 R and all \u03b5 \u2208 R it holds that\n\nP (Q(rh) > \u03b5) \u2264 (rP (Q(F ) > \u03b5))2h\n\n.\n\nProof of Proposition 2 and Theorem 3. By proposition 7, for any Radon point r of a set S there must\nbe two points a, b \u2208 S with Q (a) ,Q (b) \u2265 Q (r). Henceforth, the probability of Q (r) > \u03b5 is less\nthan or equal to the probability of the pair a, b having Q (a) ,Q (b) > \u03b5. Proposition 2 follows by an\napplication of the union bound on all pairs from S. Repeated application of the proposition proves\nTheorem 3.\n\nAcknowledgements\n\nPart of this work was conducted while Mario Boley, Olana Missura, and Thomas G\u00a8artner were at the\nUniversity of Bonn and partially funded by the German Science Foundation (DFG, under ref. GA\n1615/1-1 and GA 1615/2-1). The authors would like to thank Dino Oglic, Graham Hutton, Roderick\nMacKenzie, and Stefan Wrobel for valuable discussions and comments.\n\n9\n\n\fReferences\n[1] Sanjeev Arora and Boaz Barak. Computational complexity: A modern approach. Cambridge\n\nUniversity Press, 2009.\n\n[2] Maria Florina Balcan, Yingyu Liang, Le Song, David Woodruff, and Bo Xie. Communication\nIn Proceedings of the 22nd ACM\nef\ufb01cient distributed kernel principal component analysis.\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 725\u2013\n734, 2016.\n\n[3] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[4] Manuel Blum. A machine-independent theory of the complexity of recursive functions. Jour-\n\nnal of the ACM (JACM), 14(2):322\u2013336, 1967.\n\n[5] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability\nand the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), 36(4):929\u2013965, 1989.\n\n[6] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed op-\ntimization and statistical learning via the alternating direction method of multipliers. Founda-\ntions and Trends\u00ae in Machine Learning, 3(1):1\u2013122, 2011.\n\n[7] Ashok K. Chandra and Larry J. Stockmeyer. Alternation.\n\nFoundations of Computer Science, pages 98\u2013108, 1976.\n\nIn 17th Annual Symposium on\n\n[8] Kenneth L. Clarkson, David Eppstein, Gary L. Miller, Carl Sturtivant, and Shang-Hua Teng.\nApproximating center points with iterative Radon points. International Journal of Computa-\ntional Geometry & Applications, 6(3):357\u2013377, 1996.\n\n[9] Stephen A. Cook. Deterministic CFL\u2019s are accepted simultaneously in polynomial time and log\nsquared space. In Proceedings of the eleventh annual ACM symposium on Theory of computing,\npages 338\u2013345, 1979.\n\n[10] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online\nprediction using mini-batches. Journal of Machine Learning Research, 13(1):165\u2013202, 2012.\n\n[11] Shai Fine and Katya Scheinberg. Ef\ufb01cient svm training using low-rank kernel representations.\n\nJournal of Machine Learning Research, 2:243\u2013264, 2002.\n\n[12] Yoav Freund, Yishay Mansour, and Robert E. Schapire. Why averaging classi\ufb01ers can protect\nagainst over\ufb01tting. In Proceedings of the 8th International Workshop on Arti\ufb01cial Intelligence\nand Statistics, 2001.\n\n[13] Raymond Greenlaw, H. James Hoover, and Walter L. Ruzzo. Limits to parallel computation:\n\nP-completeness theory. Oxford University Press, Inc., 1995.\n\n[14] Steve Hanneke. The optimal sample complexity of PAC learning. Journal of Machine Learning\n\nResearch, 17(38):1\u201315, 2016.\n\n[15] William B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert\n\nspace. Contemporary mathematics, 26(189-206):1, 1984.\n\n[16] David Kay and Eugene W. Womble. Axiomatic convexity theory and relationships between\nthe Carath\u00b4eodory, Helly, and Radon numbers. Paci\ufb01c Journal of Mathematics, 38(2):471\u2013485,\n1971.\n\n[17] Clyde P. Kruskal, Larry Rudolph, and Marc Snir. A complexity theory of ef\ufb01cient parallel\n\nalgorithms. Theoretical Computer Science, 71(1):95\u2013132, 1990.\n\n[18] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis.\n\nIntroduction to parallel\ncomputing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc.,\n1994.\n\n10\n\n\f[19] Moshe Lichman. UCI machine learning repository, 2013. URL http://archive.ics.\n\nuci.edu/ml.\n\n[20] Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou. Distributed learning with regularized least\nsquares. Journal of Machine Learning Research, 18(92):1\u201331, 2017. URL http://jmlr.\norg/papers/v18/15-586.html.\n\n[21] Philip M. Long and Rocco A. Servedio. Algorithms and hardness results for parallel large\n\nmargin learning. Journal of Machine Learning Research, 14:3105\u20133128, 2013.\n\n[22] Chenxin Ma, Jakub Kone\u02c7cn\u00b4y, Martin Jaggi, Virginia Smith, Michael I. Jordan, Peter Richt\u00b4arik,\nand Martin Tak\u00b4a\u02c7c. Distributed optimization with arbitrary local solvers. Optimization Methods\nand Software, 32(4):813\u2013848, 2017.\n\n[23] Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S. Mann. Ef\ufb01-\ncient large-scale distributed training of conditional maximum entropy models. In Advances in\nNeural Information Processing Systems, pages 1231\u20131239, 2009.\n\n[24] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Ar-\ncas. Communication-ef\ufb01cient learning of deep networks from decentralized data. In Arti\ufb01cial\nIntelligence and Statistics, pages 1273\u20131282, 2017.\n\n[25] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies\nLiu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin,\nMichael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. Mllib: Machine learn-\ning in apache spark. Journal of Machine Learning Research, 17(34):1\u20137, 2016.\n\n[26] Cleve Moler. Matrix computation on distributed memory multiprocessors. Hypercube Multi-\n\nprocessors, 86(181-195):31, 1986.\n\n[27] Ilia Nouretdinov, Sergi G. Costafreda, Alexander Gammerman, Alexey Chervonenkis,\nVladimir Vovk, Vladimir Vapnik, and Cynthia H.Y. Fu. Machine learning classi\ufb01cation with\ncon\ufb01dence: application of transductive conformal predictors to MRI-based diagnostic and\nprognostic markers in depression. Neuroimage, 56(2):809\u2013813, 2011.\n\n[28] Dino Oglic and Thomas G\u00a8artner. Nystr\u00a8om method with kernel k-means++ samples as land-\nIn Proceedings of the 34th International Conference on Machine Learning, pages\n\nmarks.\n2652\u20132660, 06\u201311 Aug 2017.\n\n[29] Fabian Pedregosa, Ga\u00a8el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,\nOlivier Grisel, Mathieu Blondel, Peter Prettenhofer, RRon Weiss, Vincent Dubourg, Jake\nVanderplas, AAlexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and\n\u00b4Edouard Duchesnay Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[30] Johann Radon. Mengen konvexer K\u00a8orper, die einen gemeinsamen Punkt enthalten. Mathema-\n\ntische Annalen, 83(1):113\u2013115, 1921.\n\n[31] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Ad-\n\nvances in Neural Information Processing Systems, pages 1177\u20131184, 2007.\n\n[32] Jonathan D. Rosenblatt and Boaz Nadler. On the optimality of averaging in distributed statis-\n\ntical learning. Information and Inference, 5(4):379\u2013404, 2016.\n\n[33] Alexander M. Rubinov. Abstract convexity and global optimization, volume 44. Springer\n\nScience & Business Media, 2013.\n\n[34] Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In Proceed-\nings of the 52nd Annual Allerton Conference on Communication, Control, and Computing,\npages 850\u2013857, 2014.\n\n[35] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-ef\ufb01cient distributed optimization\nusing an approximate newton-type method. In International conference on machine learning,\npages 1000\u20131008, 2014.\n\n11\n\n\f[36] Robin Sommer and Vern Paxson. Outside the closed world: On using machine learning for\n\nnetwork intrusion detection. In Symposium on Security and Privacy, pages 305\u2013316, 2010.\n\n[37] Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright. Optimization for machine learning.\n\nMIT Press, 2012.\n\n[38] John W Tukey. Mathematics and the picturing of data. In Proceedings of the International\n\nCongress of Mathematicians, volume 2, pages 523\u2013531, 1975.\n\n[39] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142,\n\n1984.\n\n[40] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked\n\nscience in machine learning. SIGKDD Explorations, 15(2):49\u201360, 2013.\n\n[41] Vladimir N. Vapnik and Alexey Y. Chervonenkis. On the uniform convergence of relative\nfrequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):\n264\u2013280, 1971.\n\n[42] Jeffrey S. Vitter and Jyh-Han Lin. Learning in parallel. Information and Computation, 96(2):\n\n179\u2013202, 1992.\n\n[43] Ulrike Von Luxburg and Bernhard Sch\u00a8olkopf. Statistical learning theory: models, concepts,\nand results. In Inductive Logic, volume 10 of Handbook of the History of Logic, pages 651\u2013\n706. Elsevier, 2011.\n\n[44] Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. Data Mining: Practical\n\nmachine learning tools and techniques. Elsevier, 2017.\n\n[45] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-ef\ufb01cient algorithms\nfor statistical optimization. Journal of Machine Learning Research, 14(1):3321\u20133363, 2013.\n\n[46] Martin Zinkevich, Markus Weimer, Alexander J. Smola, and Lihong Li. Parallelized stochastic\ngradient descent. In Advances in Neural Information Processing Systems, pages 2595\u20132603,\n2010.\n\n12\n\n\f", "award": [], "sourceid": 3243, "authors": [{"given_name": "Michael", "family_name": "Kamp", "institution": "University of Bonn / Fraunhofer IAIS"}, {"given_name": "Mario", "family_name": "Boley", "institution": "Max Planck Institute for Informatics and Saarland University"}, {"given_name": "Olana", "family_name": "Missura", "institution": "Google Inc"}, {"given_name": "Thomas", "family_name": "G\u00e4rtner", "institution": "University of Nottingham"}]}