{"title": "GENO -- GENeric Optimization for Classical Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2190, "page_last": 2201, "abstract": "Although optimization is the longstanding, algorithmic backbone of machine learning new models still require the time-consuming implementation of new solvers. As a result, there are thousands of implementations of optimization algorithms for machine learning problems. A natural question is, if it is always necessary to implement a new solver, or is there one algorithm that is sufficient for most models. Common belief suggests that such a one-algorithm-fits-all approach cannot work, because this algorithm cannot exploit model specific structure. At least, a generic algorithm cannot be efficient and robust on a wide variety of problems. Here, we challenge this common belief. We have designed and implemented the optimization framework GENO (GENeric Optimization) that combines a modeling language with a generic solver. GENO takes the declaration of an optimization problem and generates a solver for the specified problem class. The framework is flexible enough to encompass most of the classical machine learning problems. We show on a wide variety of classical but also some recently suggested problems that the automatically generated solvers are (1) as efficient as well engineered, specialized solvers, (2) more efficient by a decent margin than recent state-of-the-art solvers, and (3) orders of magnitude more efficient than classical modeling language plus solver approaches.", "full_text": "GENO \u2013 GENeric Optimization\nfor Classical Machine Learning\n\nS\u00f6ren Laue\n\n&\n\nFriedrich-Schiller-Universit\u00e4t Jena\n\nData Assessment Solutions GmbH\n\nsoeren.laue@uni-jena.de\n\nMatthias Mitterreiter\n\nFriedrich-Schiller-Universit\u00e4t Jena\n\nGermany\n\nmatthias.mitterreiter@uni-jena.de\n\nJoachim Giesen\n\nFriedrich-Schiller-Universit\u00e4t Jena\n\nGermany\n\njoachim.giesen@uni-jena.de\n\nAbstract\n\nAlthough optimization is the longstanding algorithmic backbone of machine learn-\ning, new models still require the time-consuming implementation of new solvers.\nAs a result, there are thousands of implementations of optimization algorithms\nfor machine learning problems. A natural question is, if it is always necessary\nto implement a new solver, or if there is one algorithm that is suf\ufb01cient for most\nmodels. Common belief suggests that such a one-algorithm-\ufb01ts-all approach can-\nnot work, because this algorithm cannot exploit model speci\ufb01c structure and thus\ncannot be ef\ufb01cient and robust on a wide variety of problems. Here, we challenge\nthis common belief. We have designed and implemented the optimization frame-\nwork GENO (GENeric Optimization) that combines a modeling language with a\ngeneric solver. GENO generates a solver from the declarative speci\ufb01cation of an\noptimization problem class. The framework is \ufb02exible enough to encompass most\nof the classical machine learning problems. We show on a wide variety of classical\nbut also some recently suggested problems that the automatically generated solvers\nare (1) as ef\ufb01cient as well-engineered specialized solvers, (2) more ef\ufb01cient by\na decent margin than recent state-of-the-art solvers, and (3) orders of magnitude\nmore ef\ufb01cient than classical modeling language plus solver approaches.\n\n1\n\nIntroduction\n\nOptimization is at the core of machine learning and many other \ufb01elds of applied research, for instance\noperations research, optimal control, and deep learning. The latter \ufb01elds have embraced frameworks\nthat combine a modeling language with only a few optimization solvers; interior point solvers in\noperations research and stochastic gradient descent (SGD) and variants thereof in deep learning\nframeworks like TensorFlow, PyTorch, or Caffe. That is in stark contrast to classical (i.e., non-deep)\nmachine learning, where new problems are often accompanied by new optimization algorithms\nand their implementation. However, designing and implementing optimization algorithms is still a\ntime-consuming and error-prone task.\nThe lack of an optimization framework for classical machine learning problems can be explained\npartially by the common belief, that any ef\ufb01cient solver needs to exploit problem speci\ufb01c structure.\nHere, we challenge this common belief.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe introduce GENO (GENeric Optimization), an optimization framework that allows to state op-\ntimization problems in an easy-to-read modeling language. From the speci\ufb01cation an optimizer\nis automatically generated by using automatic differentiation on a symbolic level. The optimizer\ncombines a quasi-Newton solver with an augmented Lagrangian approach for handling constraints.\nAny generic modeling language plus solver approach frees the user from tedious implementation\naspects and allows to focus on modeling aspects of the problem at hand. However, it is required\nthat the solver is ef\ufb01cient and accurate. Contrary to common belief, we show here that the solvers\ngenerated by GENO are (1) as ef\ufb01cient as well-engineered, specialized solvers at the same or better\naccuracy, (2) more ef\ufb01cient by a decent margin than recent state-of-the-art solvers, and (3) orders of\nmagnitude more ef\ufb01cient than classical modeling language plus solver approaches.\n\nRelated work. Classical machine learning is typically served by toolboxes like scikit-learn [48],\nWeka [23], and MLlib [40]. These toolboxes mainly serve as wrappers for a collection of well-\nengineered implementations of standard solvers like LIBSVM [11] for support vector machines or\nglmnet [24] for generalized linear models. A disadvantage of the toolbox approach is a lacking of\n\ufb02exibility. An only slightly changed model, for instance by adding a non-negativity constraint, might\nalready be missing in the framework.\nModeling languages provide more \ufb02exibility since they allow to specify problems from large problem\nclasses. Popular modeling languages for optimization are CVX [14, 29] for MATLAB and its Python\nextension CVXPY [3, 17], and JuMP [20] which is bound to Julia. In the operations research\ncommunity AMPL [22] and GAMS [9] have been used for many years. All these languages take an\ninstance of an optimization problem and transform it into some standard form of a linear program\n(LP), quadratic program (QP), second-order cone program (SOCP), or semi-de\ufb01nite program (SDP).\nThe transformed problem is then addressed by solvers for the corresponding standard form. However,\nthe transformation into standard form can be inef\ufb01cient, because the formal representation in standard\nform can grow substantially with the problem size. This representational inef\ufb01ciency directly\ntranslates into computational inef\ufb01ciency.\nThe modeling language plus solver paradigm has been made deployable in the CVXGEN [39],\nQPgen [26], and OSQP [4] projects. In these projects code is generated for the speci\ufb01ed problem\nclass. However, the problem dimension and sometimes the underlying sparsity pattern of the data\nneeds to be \ufb01xed. Thus, the size of the generated code still grows with a growing problem dimension.\nAll these projects are targeted at embedded systems and are optimized for small or sparse problems.\nThe underlying solvers are based on Newton-type methods that solve a Newton system of equations\nby direct methods. Solving these systems is ef\ufb01cient only for small problems or problems where the\nsparsity structure of the Hessian can be exploited in the Cholesky factorization. Neither condition is\ntypically met in standard machine learning problems.\nDeep learning frameworks like TensorFlow [1], PyTorch [47], or Caffe [33] are ef\ufb01cient and fairly\n\ufb02exible. However, they target only deep learning problems that are typically unconstrained problems\nthat ask to optimize a separable sum of loss functions. Algorithmically, deep learning frameworks\nusually employ some form of stochastic gradient descent (SGD) [51], the rationale being that\ncomputing the full gradient is too slow and actually not necessary. A drawback of SGD-type\nalgorithms is that they need careful parameter tuning of, for instance, the learning rate or, for\naccelerated SGD, the momentum. Parameter tuning is a time-consuming and often data-dependent\ntask. A non-careful choice of these parameters can turn the algorithm slow or even cause it to diverge.\nAlso, SGD type algorithms cannot handle constraints.\nGENO, the framework that we present here, differs from the standard modeling language plus solver\napproach by a much tighter coupling of the language and the solver. GENO does not transform\nproblem instances but whole problem classes, including constrained problems, into a very general\nstandard form. Since the standard form is independent of any speci\ufb01c problem instance it does not\ngrow for larger instances. GENO does not require the user to tune parameters and the generated code\nis highly ef\ufb01cient.\n\n2 The GENO Pipeline\n\nGENO features a modeling language and a solver that are tightly coupled. The modeling language\nallows to specify a whole class of optimization problems in terms of an objective function and\n\n2\n\n\fTable 1: Comparison of approaches/frameworks for optimization in machine learning.\n\nhandwritten TensorFlow,\n\nsolver\n\nPyTorch\n\n\ufb02exible\nef\ufb01cient\ndeployable / stand-alone\ncan accommodate constraints\nparameter free (learning rate, ...)\nallows non-convex problems\n\n\u0017\n\u0013\n\u0013\n\u0013\n\u0017/\u0013\n\u0013\n\n\u0013\n\u0013\n\u0017\n\u0017\n\u0017\n\u0013\n\nWeka,\n\nScikit-learn\n\nCVXPY GENO\n\n\u0017\n\u0013\n\u0017\n\u0013\n\u0013\n\u0013\n\n\u0013\n\u0017\n\u0017\n\u0013\n\u0013\n\u0017\n\n\u0013\n\u0013\n\u0013\n\u0013\n\u0013\n\u0013\n\nconstraints that are given as vectorized linear algebra expressions. Neither the objective function nor\nthe constraints need to be differentiable. Non-differentiable problems are transformed into constrained,\ndifferentiable problems. A general purpose solver for constrained, differentiable problems is then\ninstantiated with the objective function, the constraint functions and their respective gradients. The\ngradients are computed by the matrix and tensor calculus algorithm [36] and its extension [37]. The\ntight integration of the modeling language and the solver is possible only because of this recent\nprogress in computing derivatives of vectorized linear algebra expressions.\nGenerating a solver takes only a few milliseconds. Once it has been generated the solver can be used\nlike any hand-written solver for every instance of the speci\ufb01ed problem class. An interface to the\nGENO framework can be found at http://www.geno-project.org.\n\n2.1 Modeling Language\n\nparameters\n\nMatrix A\nVector b\n\nA GENO speci\ufb01cation has four blocks (cf. the example to the right\nthat shows an (cid:96)1-norm minimization problem from compressed\nsensing where the signal is known to be an element from the\nunit simplex.): (1) Declaration of the problem parameters that can\nbe of type Matrix, Vector, or Scalar, (2) declaration of the op-\ntimization variables that also can be of type Matrix, Vector, or\nScalar, (3) speci\ufb01cation of the objective function in a MATLAB-\nlike syntax, and \ufb01nally (4) speci\ufb01cation of the constraints, also in a\nMATLAB-like syntax that supports the following operators and func-\ntions: +, -, *, /, .*, ./, \u2227, .\u2227, log, exp, sin, cos,\ntanh, abs, norm1, norm2, sum, tr, det, inv. The set of\noperators and functions can be expanded when needed.\nNote that in contrast to instance-based modeling languages like CVXPY no dimensions have to be\nspeci\ufb01ed. Also, the speci\ufb01ed problems do not need to be convex. In the non-convex case, only a local\noptimal solution will be computed.\n\nA*x == b\nsum(x) == 1\nx >= 0\n\nmin\n\nst\n\nvariables\n\nVector x\n\nnorm1(x)\n\n2.2 Generic Optimizer\n\nAt its core, GENO\u2019s generic optimizer is a solver for unconstrained, smooth optimization problems.\nThis solver is then extended to handle also non-smooth and constrained problems. In the following\nwe \ufb01rst describe the smooth, unconstrained solver before we detail how it is extended to handling\nnon-smooth and constrained optimization problems.\n\n\u221a\n\nSolver for unconstrained, smooth problems. There exist quite a number of algorithms for uncon-\nstrained optimization. Since in our approach we target problems with a few dozen up to a few million\nvariables, we decided to build on a \ufb01rst-order method. This still leaves many options. Nesterov\u2019s\nmethod [44] has an optimal theoretical running time, that is, its asymptotic running time matches the\nlower bounds in \u2126(1/\n\u03b5) in the smooth, convex case and \u2126(log(1/\u03b5)) in the strongly convex case\nwith optimal dependence on the Lipschitz constants L and \u00b5 that have to be known in advance. Here\nL and \u00b5 are upper and lower bounds, respectively, on the eigenvalues of the Hessian. On quadratic\nproblems quasi-Newton methods share the same optimal convergence guarantee [32, 43] without\nrequiring the values for these parameters. In practice, quasi-Newton methods often outperform\nNesterov\u2019s method, although they cannot beat it in the worst case. It is important to keep in mind that\n\n3\n\n\ftheoretical running time guarantees do not always translate into good performance in practice. For\ninstance, even the simple subgradient method has been shown to have a convergence guarantee in\nO(log(1/\u03b5)) on strongly convex problems [28], but it is certainly not competitive in general.\nHence, we settled on a quasi-Newton method and implemented the well-established L-BFGS-B\nalgorithm [10, 55] that can also handle box constraints on the variables. It serves as the solver for\nunconstrained, smooth problems. The algorithm combines the standard limited memory quasi-Newton\nmethod with a projected gradient path approach. In each iteration, the gradient path is projected onto\nthe box constraints and the quadratic function based on the second-order approximation (L-BFGS)\nof the Hessian is minimized along this path. All variables that are at their boundaries are \ufb01xed and\nonly the remaining free variables are optimized using the second-order approximation. Any solution\nthat is not within the bound constraints is projected back onto the feasible set by a simple min/max\noperation [41]. Only in rare cases, a projected point does not form a descent direction. In this case,\ninstead of using the projected point, one picks the best point that is still feasible along the ray towards\nthe solution of the quadratic approximation. Then, a line search is performed for satisfying the strong\nWolfe conditions [53, 54]. This ensures convergence also in the non-convex case. The line search\nalso removes the need for a step length or learning rate that is usually necessary in SGD, subgradient\nalgorithms, or Nesterov\u2019s method. Here, we use the line search proposed in [42] which we enhanced\nby a backtracking line search in case the solver enters a region where the function is not de\ufb01ned.\n\nSolver for unconstrained non-smooth problems. Machine learning often entails non-smooth\noptimization problems, for instance all problems that employ (cid:96)1-regularization. Proximal gradient\nmethods are a general technique for addressing such problems [49]. Here, we pursue a different\napproach. All non-smooth convex optimization problems that are allowed by our modeling language\ncan be written as minx{maxi fi(x)} with smooth functions fi(x) [45]. This class is \ufb02exible enough\nto accommodate most of the non-smooth objective functions encountered in machine learning. All\nproblems in this class can be transformed into constrained, smooth problems of the form\n\nmin\nt,x\n\nt\n\ns. t. fi(x) \u2264 t.\n\nThe transformed problems can then be solved by the solver for constrained, smooth optimization\nproblems that we describe next.\n\nSolver for smooth constrained problems. There also quite a few options for solving smooth,\nconstrained problems, among them projected gradient methods, the alternating direction method\nof multipliers (ADMM) [8, 25, 27], and the augmented Lagrangian approach [30, 50]. For GENO,\nwe decided to follow the augmented Lagrangian approach, because this allows us to (re-)use our\nsolver for smooth, unconstrained problems directly. Also, the augmented Lagrangian approach\nis more generic than ADMM. All ADMM-type methods need a proximal operator that cannot be\nderived automatically from the problem speci\ufb01cation and a closed-form solution is sometimes not\neasy to compute. Typically, one uses standard duality theory for deriving the prox-operator. In [49],\nprox-operators are tabulated for several functions.\nThe augmented Lagrangian method can be used for solving the following general standard form of an\nabstract constrained optimization problem\n\nf (x)\nmin\nx\ns. t. h(x) = 0\ng(x) \u2264 0,\n\n(1)\n\nwhere x \u2208 Rn, f : Rn \u2192 R, h : Rn \u2192 Rm, g : Rn \u2192 Rp are differentiable functions, and the\nequality and inequality constraints are understood component-wise.\nThe augmented Lagrangian of Problem (1) is the following function\n\nL\u03c1(x, \u03bb, \u00b5) = f (x) +\n\n\u03c1\n2\n\n+\n\n\u03c1\n2\n\ng(x) +\n\n\u00b5\n\u03c1\n\nwhere \u03bb \u2208 Rm and \u00b5 \u2208 Rp\u22650 are Lagrange multipliers, \u03c1 > 0 is a constant, (cid:107)\u00b7(cid:107) denotes the Euclidean\nnorm, and (v)+ denotes max{v, 0}. The Lagrange multipliers are also referred to as dual variables.\nIn principle, the augmented Lagrangian is the standard Lagrangian of Problem (1) augmented with a\n\n4\n\n(cid:13)(cid:13)(cid:13)(cid:13)h(x) +\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n\u03bb\n\u03c1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:18)\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n,\n\n+\n\n\fquadratic penalty term. This term provides increased stability during the optimization process which\ncan be seen for example in the case that Problem (1) is a linear program.\nThe Augmented Lagrangian Algorithm 1 runs in iterations. In each iteration it solves an unconstrained\nsmooth optimization problem. Upon convergence, it will return an approximate solution x to the\noriginal problem along with an approximate solution of the Lagrange multipliers for the dual problem.\nIf Problem (1) is convex, then the algorithm returns the global optimal solution. Otherwise, it\nreturns a local optimum [5]. The update of the multiplier \u03c1 can be ignored and the algorithm still\nconverges [5]. However, in practice it is bene\ufb01cial to increase it depending on the progress in\nsatisfying the constraints [6]. If the in\ufb01nity norm of the constraint violation decreases by a factor less\nthan \u03c4 = 1/2 in one iteration, then \u03c1 is multiplied by a factor of two.\n\nAlgorithm 1 Augmented Lagrangian Algorithm\n1: input: instance of Problem 1\n2: output: approximate solution x \u2208 Rn, \u03bb \u2208 Rp, \u00b5 \u2208 Rm\u22650\n3: initialize x0 = 0, \u03bb0 = 0, \u00b50 = 0, and \u03c1 = 1\n4: repeat\n5:\n6:\n7:\n8:\n9: until convergence\n10: return xk, \u03bbk, \u00b5k\n\n\u00b5k+1 := (cid:0)\u00b5k + \u03c1g(xk+1)(cid:1)\n\nxk+1 := argminx L\u03c1(x, \u03bbk, \u00b5k)\n\u03bbk+1 := \u03bbk + \u03c1h(xk+1)\n\nupdate \u03c1\n\n+\n\n3 Limitations\n\nWhile GENO is very general and ef\ufb01cientit also has some limitations that we discuss here. For small\nproblems, i.e., problems with only a few dozen variables, Newton-type methods with a direct solver\nfor the Newton system can be even faster. GENO also does not target deep learning applications,\nwhere gradients do not need to be computed fully but can be sampled.\nSome problems can pose numerical problems, for instance problems containing an exp operator\nmight cause an over\ufb02ow/under\ufb02ow. However, this is a problem that is faced by all frameworks. It is\nusually addressed by introducing special operators like logsumexp.\nFurthermore, GENO does not perform sanity checks on the provided input. Any syntactically correct\nproblem speci\ufb01cation is accepted by GENO as a valid input. For example, log(det(xx(cid:62))), where x\nis a vector, is a valid expression. But the determinant of the outer product will always be zero and\nhence, taking the logarithm will fail. It lies within the responsibility of the user to make sure that\nexpressions are mathematically valid.\n\n4 Experiments\n\nWe conducted a number of experiments to show the wide applicability and ef\ufb01ciency of our approach.\nFor the experiments we have chosen classical problems that come with established well-engineered\nsolvers like logistic regression or elastic net regression, but also problems and algorithms that have\nbeen published at NeurIPS and ICML only within the last few years. The experiments cover smooth\nunconstrained problems as well as constrained, and non-smooth problems. To prevent a bias towards\nGENO, we always used the original code for the competing methods and followed the experimental\nsetup in the papers where these methods have been introduced. We ran the experiments on standard\ndata sets from the LIBSVM data set repository, and, in some cases, on synthetic data sets on which\ncompeting methods had been evaluated in the corresponding papers.\nSpeci\ufb01cally, our experiments cover the following problems and solvers: (cid:96)1- and (cid:96)2-regularized logistic\nregression, support vector machines, elastic net regression, non-negative least squares, symmetric\nnon-negative matrix factorization, problems from non-convex optimization, and compressed sensing.\nAmong other algorithms, we compared against a trust-region Newton method with conjugate gradient\n\n5\n\n\fdescent for solving the Newton system, sequential minimal optimization (SMO), dual coordinate\ndescent, proximal methods including ADMM and variants thereof, interior point methods, accelerated\nand variance reduced variants of SGD, and Nesterov\u2019s optimal gradient descent.\nOur test machine was equipped with an eight-core Intel Xeon CPU E5-2643 and 256GB RAM. We\nused Python 3.6, along with NumPy 1.16, SciPy 1.2, and scikit-learn 0.20. In some cases the original\ncode of the competing methods was written and run in MATLAB R2019.\n\nFigure 1: The regularization path of (cid:96)1-regularized logistic regression for the Iris data set using\nSAGA, GENO, CVXPY, and LIBLINEAR. The four coef\ufb01cients of each model are plotted as a\nregularization path from strong regularization, where all coef\ufb01cients are 0 to looser regularization,\nwhere coef\ufb01cients can attain non-zero values.\n\n4.1 Regularization Path for (cid:96)1-regularized Logistic Regression\n\nLogistic regression is probably the most popular linear, binary classi\ufb01cation method. It is given by\nthe following unconstrained optimization problem\n\n(cid:80)\ni log(exp(\u2212yiXiw) + 1),\n\n\u03bb \u00b7 r(w) + 1\n\nm\n\nmin\n\nw\n\nwhere X \u2208 Rm\u00d7n is a data matrix, y \u2208 {\u22121, +1}m is a label vector, r : R \u2192 R is the regularizer,\nand \u03bb \u2208 R is the regularization parameter. The regularizer r is usually chosen to be the the (cid:96)1-norm\nor the (cid:96)2-norm.\nComputing the regularization path of the (cid:96)1-regularized logistic regression problem [13] is a clas-\nsical machine learning problem, and only boring at a \ufb01rst glance. The problem is well suited for\ndemonstrating the importance of both aspects of our approach, namely \ufb02exibility and ef\ufb01ciency. As a\nstandard problem it is covered in scikit-learn. The scikit-learn implementation features the SAGA\nalgorithm [16] for computing the whole regularization path that is shown in Figure 1. This \ufb01gure\ncan also be found on the scikit-learn website 1. However, when using GENO, the regularization\npath looks different, see also Figure 1. Checking the objective functions values reveals that the\nprecision of the SAGA algorithm is not enough for tracking the path faithfully. GENO\u2019s result can be\nreproduced by using CVXPY except for one outlier at which CVXPY did not compute the optimal\nsolution. LIBLINEAR [21, 57] can also be used for computing the regularization path, but also fails\nto follow the exact path. This can be explained as follows: LIBLINEAR also does not compute\noptimal solutions, but more importantly, in contrast to the original formulation, it penalizes the bias\nfor algorithmic reasons. Thus, changing the problem slightly can lead to fairly different results.\n\n1https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_path.html\n\n6\n\n\fCVXPY, like GENO, is \ufb02exible and precise enough to accommodate the original problem formulation\nand to closely track the regularization path. But it is not as ef\ufb01cient as GENO. On the problem used in\nFigure 1 SAGA takes 4.3 seconds, the GENO solver takes 0.5 seconds, CVXPY takes 13.5 seconds,\nand LIBLINEAR takes 0.05 seconds but for a slightly different problem and insuf\ufb01cient accuracy.\n\n4.2\n\n(cid:96)2-regularized Logistic Regression\n\nSince it is a classical problem there exist many well-engineered solvers for (cid:96)2-regularized logistic\nregression. The problem also serves as a testbed for new algorithms. We compared GENO to\nthe parallel version of LIBLINEAR and a number of recently developed algorithms and their\nimplementations, namely Point-SAGA [15], SDCA [52], and catalyst SDCA [38]). The latter\nalgorithms implement some form of SGD. Thus their running time heavily depends on the values for\nthe learning rate (step size) and the momentum parameter in the case of accelerated SGD. The best\nparameter setting often depends on the regularization parameter and the data set. We have used the\ncode provided by [15] and the parameter settings therein.\nFor our experiments we set the regularization parameter \u03bb = 10\u22124 and used real world data sets\nthat are commonly used in experiments involving logistic regression. GENO converges almost as\nrapidly as LIBLINEAR and outperforms any of the recently published solvers by a good margin, see\nFigure 2.\n\nFigure 2: Running times for different solvers on the (cid:96)2-regularized logistic regression problem.\n\nOn substantially smaller data sets we also compared GENO to CVXPY with both the ECOS [19] and\nthe SCS solver [46]. As can be seen from Table 2, GENO is orders of magnitude faster.\n\nTable 2: Running times in seconds for different general purpose solvers on small instances of the\n(cid:96)2-regularized logistic regression problem. The approximation error is close to 10\u22126 for all solvers.\n\nSolver\n\nheart\nGENO 0.005\n1.999\nECOS\nSCS\n2.589\n\nData sets\n\nionosphere\n0.013\n2.775\n3.330\n\nbreast-cancer\n0.004\n5.080\n6.224\n\naustralian\n0.014\n5.380\n6.578\n\ndiabetes\n0.006\n5.881\n6.743\n\na1a\n0.023\n12.606\n16.361\n\na5a\n0.062\n57.467\n87.904\n\n7\n\n101100time / sec10151012109106103100103optimality gapa9aPoint-SAGASAGACSDCALiblinearGENO100101102time / sec10151012109106103100103optimality gapcovtype.binaryPoint-SAGASAGACSDCALiblinearGENO101100time / sec10151012109106103100103optimality gapmushroomsPoint-SAGASAGACSDCALiblinearGENO101102time / sec10151012109106103100103106optimality gaprcv1_test.binaryPoint-SAGASAGACSDCALiblinearGENO101100101time / sec10151012109106103100103optimality gapreal-simPoint-SAGASAGACSDCALiblinearGENO100101time / sec10151012109106103100103optimality gapwebspamPoint-SAGASAGACSDCALiblinearGENO\f4.3 Symmetric Non-negative Matrix Factorization\n\nNon-negative matrix factorization (NMF) and its many variants are standard methods for recom-\nmender systems [2] and topic modeling [7, 31]. It is known as symmetric NMF, when both factor\nmatrices are required to be identical. Symmetric NMF is used for clustering problems [35] and known\nto be equivalent to k-means kernel clustering [18]. Given a target matrix T \u2208 Rn\u00d7n, symmetric NMF\nis given as the following optimization problem\n\n(cid:13)(cid:13)T \u2212 U U(cid:62)(cid:13)(cid:13)2\n\nFro\n\nmin\n\nU\n\ns. t. U \u2265 0,\n\nwhere U \u2208 Rn\u00d7k is a positive factor matrix of rank k. Note, the problem cannot be modeled and\nsolved by CVXPY since it is non-convex. It has been addressed recently in [56] by two new methods.\nBoth methods are symmetric variants of the alternating non-negative least squares (ANLS) [34] and\nthe hierarchical ALS (HALS) [12] algorithms.\nWe compared GENO to both methods. For the comparison we used the code and same experimental\nsetup as in [56]. Random positive-semide\ufb01nite target matrices X = \u02c6U \u02c6U(cid:62) of different sizes were\ncomputed from random matrices \u02c6U \u2208 Rn\u00d7k with absolute value Gaussian entries. As can be seen in\nFigure 3, GENO outperforms both methods (SymANLS and SymHALS) by a large margin.\n\nFigure 3: Convergence speed on the symmetric non-negative matrix factorization problem for different\nparameter values. On the left, the times for m = 50, k = 5, in the middle for m = 500, k = 10, and\non the right for m = 2000, k = 15.\n\n4.4 Further Experiments\n\nFurther experiments on support vector machines, elastic net regression, non-negative least squares,\nproblems from non-convex optimization, and compressed sensing along with the GENO models for\nall experiments can be found in the supplemental material.\n\n5 Conclusions\n\nWhile other \ufb01elds of applied research that heavily rely on optimization, like operations research,\noptimal control, and deep learning, have adopted optimization frameworks, this is not the case for\nclassical machine learning. Instead, classical machine learning methods are still mostly accessed\nthrough toolboxes like scikit-learn, Weka, or MLlib. These toolboxes provide well-engineered\nsolutions for many of the standard problems, but lack the \ufb02exibility to adapt the underlying models\nwhen necessary. We attribute this state of affairs to a common belief that ef\ufb01cient optimization for\nclassical machine learning needs to exploit the problem structure. Here, we have challenged this\nbelief. We have presented GENO, the \ufb01rst general purpose framework for problems from classical\nmachine learning. Using recent results in automatic differentiation, GENO combines an easy-to-read\nmodeling language with a general purpose solver. Experiments on a variety of problems from classical\nmachine learning demonstrate that GENO is as ef\ufb01cient as established, well-engineered solvers and\noften outperforms recently published state-of-the-art solvers by a good margin. It is as \ufb02exible as\nstate-of-the-art modeling language and solver frameworks, but outperforms them by a few orders of\nmagnitude.\n\n8\n\n0.00.20.40.6time / sec109107105103101101103optimality gapSymANLSSymHALSGENO0123time / sec109106103100103106optimality gapSymANLSSymHALSGENO01020304050time / sec109106103100103106optimality gapSymANLSSymHALSGENO\fAcknowledgments\n\nS\u00f6ren Laue has been funded by Deutsche Forschungsgemeinschaft (DFG) under grant LA 2971/1-1.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg,\nRajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,\nPete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A system for\nIn USENIX Conference on Operating Systems Design and\nlarge-scale machine learning.\nImplementation (OSDI), pages 265\u2013283, 2016.\n\n[2] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender\nIEEE Transactions on\n\nsystems: A survey of the state-of-the-art and possible extensions.\nKnowledge & Data Engineering, (6):734\u2013749, 2005.\n\n[3] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system\n\nfor convex optimization problems. Journal of Control and Decision, 5(1):42\u201360, 2018.\n\n[4] Goran Banjac, Bartolomeo Stellato, Nicholas Moehle, Paul Goulart, Alberto Bemporad, and\nStephen P. Boyd. Embedded code generation using the OSQP solver. In Conference on Decision\nand Control, (CDC), pages 1906\u20131911, 2017.\n\n[5] Dimitri P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA, 1999.\n\n[6] Ernesto G. Birgin and Jos\u00e9 Mario Mart\u00ednez. Practical augmented Lagrangian methods for\n\nconstrained optimization, volume 10 of Fundamentals of Algorithms. SIAM, 2014.\n\n[7] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of\n\nMachine Learning Research, 3(Jan):993\u20131022, 2003.\n\n[8] Stephen P. Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed opti-\nmization and statistical learning via the alternating direction method of multipliers. Foundations\nand Trends in Machine Learning, 3(1):1\u2013122, 2011.\n\n[9] A. Brooke, D. Kendrick, and A. Meeraus. GAMS: release 2.25 : a user\u2019s guide. The Scienti\ufb01c\n\npress series. Scienti\ufb01c Press, 1992.\n\n[10] Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm\n\nfor bound constrained optimization. SIAM J. Scienti\ufb01c Computing, 16(5):1190\u20131208, 1995.\n\n[11] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM\n\nTransactions on Intelligent Systems and Technology, 2:27:1\u201327:27, 2011.\n\n[12] Andrzej Cichocki and Anh-Huy Phan. Fast local algorithms for large scale nonnegative matrix\nand tensor factorizations. Transactions on Fundamentals of Electronics, Communications and\nComputer Sciences, 92(3):708\u2013721, 2009.\n\n[13] David R. Cox. The regression analysis of binary sequences (with discussion). J. Roy. Stat. Soc.\n\nB, 20:215\u2013242, 1958.\n\n[14] CVX Research, Inc. CVX: Matlab software for disciplined convex programming, version 2.1.\n\nhttp://cvxr.com/cvx, December 2018.\n\n[15] Aaron Defazio. A simple practical accelerated method for \ufb01nite sums. In Advances in Neural\n\nInformation Processing Systems (NIPS), pages 676\u2013684, 2016.\n\n[16] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems (NIPS), pages 1646\u20131654, 2014.\n\n[17] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for\n\nconvex optimization. Journal of Machine Learning Research, 17(83):1\u20135, 2016.\n\n9\n\n\f[18] Chris Ding, Xiaofeng He, and Horst D Simon. On the equivalence of nonnegative matrix\nfactorization and spectral clustering. In SIAM International Conference on Data Mining (SDM),\npages 606\u2013610. SIAM, 2005.\n\n[19] Alexander Domahidi, Eric Chu, and Stephen P. Boyd. ECOS: An SOCP Solver for Embedded\n\nSystems. In European Control Conference (ECC), 2013.\n\n[20] Iain Dunning, Joey Huchette, and Miles Lubin. JuMP: A modeling language for mathematical\n\noptimization. SIAM Review, 59(2):295\u2013320, 2017.\n\n[21] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIB-\nLINEAR: A library for large linear classi\ufb01cation. Journal of Machine Learning Research,\n9:1871\u20131874, 2008.\n\n[22] Robert Fourer, David M. Gay, and Brian W. Kernighan. AMPL: a modeling language for\n\nmathematical programming. Thomson/Brooks/Cole, 2003.\n\n[23] Eibe Frank, Mark A. Hall, and Ian H. Witten. The WEKA Workbench. Online Appendix for\n\"Data Mining: Practical Machine Learning Tools and Techniques\u201d. Morgan Kaufmann, fourth\nedition, 2016.\n\n[24] Jerome H. Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized\n\nlinear models via coordinate descent. Journal of Statistical Software, 33(1):1\u201322, 2010.\n\n[25] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational\nproblems via \ufb01nite element approximation. Computers & Mathematics with Applications,\n2(1):17 \u2013 40, 1976.\n\n[26] P. Giselsson and S. Boyd. Linear convergence and metric selection for Douglas-Rachford\n\nsplitting and ADMM. IEEE Transactions on Automatic Control, 62(2):532\u2013544, Feb 2017.\n\n[27] R. Glowinski and A. Marroco. Sur l\u2019approximation, par \u00e9l\u00e9ments \ufb01nis d\u2019ordre un, et la\nr\u00e9solution, par p\u00e9nalisation-dualit\u00e9 d\u2019une classe de probl\u00e8mes de dirichlet non lin\u00e9aires. ESAIM:\nMathematical Modelling and Numerical Analysis - Mod\u00e9lisation Math\u00e9matique et Analyse\nNum\u00e9rique, 9(R2):41\u201376, 1975.\n\n[28] Jean-Louis Gof\ufb01n. On convergence rates of subgradient optimization methods. Math. Program.,\n\n13(1):329\u2013347, 1977.\n\n[29] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel,\nS. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes in\nControl and Information Sciences, pages 95\u2013110. 2008.\n\n[30] Magnus R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and\n\nApplications, 4(5):303\u2013320, 1969.\n\n[31] Thomas Hofmann. Probabilistic latent semantic analysis. In Conference on Uncertainty in\n\nArti\ufb01cial Intelligence (UAI), pages 289\u2013296, 1999.\n\n[32] Ho-Yi Huang. Uni\ufb01ed approach to quadratically convergent algorithms for function minimiza-\n\ntion. Journal of Optimization Theory and Applications, 5(6):405\u2013423, 1970.\n\n[33] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,\nSergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature\nembedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[34] Jingu Kim and Haesun Park. Toward faster nonnegative matrix factorization: A new algorithm\nand comparisons. In IEEE International Conference on Data Mining (ICDM), pages 353\u2013362,\n2008.\n\n[35] Da Kuang, Sangwoon Yun, and Haesun Park. Symnmf: nonnegative low-rank approximation of\na similarity matrix for graph clustering. Journal of Global Optimization, 62(3):545\u2013574, 2015.\n\n10\n\n\f[36] S\u00f6ren Laue, Matthias Mitterreiter, and Joachim Giesen. Computing higher order derivatives\nof matrix and tensor expressions. In Advances in Neural Information Processing Systems\n(NeurIPS), 2018.\n\n[37] S\u00f6ren Laue, Matthias Mitterreiter, and Joachim Giesen. A simple and ef\ufb01cient tensor calculus.\n\nIn Conference on Arti\ufb01cial Intelligence (AAAI), 2020. To appear.\n\n[38] Hongzhou Lin, Julien Mairal, and Za\u00efd Harchaoui. A universal catalyst for \ufb01rst-order opti-\nmization. In Advances in Neural Information Processing Systems (NIPS), pages 3384\u20133392,\n2015.\n\n[39] Jacob Mattingley and Stephen Boyd. CVXGEN: A Code Generator for Embedded Convex\n\nOptimization. Optimization and Engineering, 13(1):1\u201327, 2012.\n\n[40] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies\nLiu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J.\nFranklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. Mllib: Machine learning in apache\nspark. Journal of Machine Learning Research, 17(1), January 2016.\n\n[41] Jos\u00e9 Luis Morales and Jorge Nocedal. Remark on \"algorithm 778: L-BFGS-B: fortran subrou-\ntines for large-scale bound constrained optimization\". ACM Trans. Math. Softw., 38(1):7:1\u20137:4,\n2011.\n\n[42] Jorge J. Mor\u00e9 and David J. Thuente. Line search algorithms with guaranteed suf\ufb01cient decrease.\n\nACM Trans. Math. Softw., 20(3):286\u2013307, 1994.\n\n[43] L. Nazareth. A relationship between the bfgs and conjugate gradient algorithms and its\nimplications for new algorithms. SIAM Journal on Numerical Analysis, 16(5):794\u2013800, 1979.\n\n[44] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of\n\nconvergence O(1/k2). Doklady AN USSR (translated as Soviet Math. Docl.), 269, 1983.\n\n[45] Yurii Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127\u2013\n\n152, 2005.\n\n[46] Brendan O\u2019Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via\noperator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and\nApplications, 169(3):1042\u20131068, 2016.\n\n[47] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS Autodiff workshop, 2017.\n\n[48] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[49] Nicholas G. Polson, James G. Scott, and Brandon T. Willard. Proximal algorithms in statistics\n\nand machine learning. arXiv preprint, May 2015.\n\n[50] M. J. D. Powell. Algorithms for nonlinear constraints that use Lagrangian functions. Mathemat-\n\nical Programming, 14(1):224\u2013248, 1969.\n\n[51] Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist.,\n\n22(3):400\u2013407, 1951.\n\n[52] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss. Journal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[53] P. Wolfe. Convergence conditions for ascent methods. SIAM Review, 11(2):226\u2013235, 1969.\n\n[54] P. Wolfe. Convergence conditions for ascent methods. ii: Some corrections. SIAM Review,\n\n13(2):185\u2013188, 1971.\n\n11\n\n\f[55] Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B:\nfortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw.,\n23(4):550\u2013560, 1997.\n\n[56] Zhihui Zhu, Xiao Li, Kai Liu, and Qiuwei Li. Dropping symmetry for fast symmetric nonnega-\ntive matrix factorization. In Advances in Neural Information Processing Systems (NeurIPS),\npages 5160\u20135170, 2018.\n\n[57] Yong Zhuang, Yu-Chin Juan, Guo-Xun Yuan, and Chih-Jen Lin. Naive parallelization of\ncoordinate descent methods and an application on multi-core l1-regularized classi\ufb01cation. In\nInternational Conference on Information and Knowledge Management (CIKM), pages 1103\u2013\n1112, 2018.\n\n12\n\n\f", "award": [], "sourceid": 1294, "authors": [{"given_name": "Soeren", "family_name": "Laue", "institution": "Friedrich Schiller University Jena / Data Assessment Solutions"}, {"given_name": "Matthias", "family_name": "Mitterreiter", "institution": "Friedrich Schiller University Jena"}, {"given_name": "Joachim", "family_name": "Giesen", "institution": "Friedrich-Schiller-Universitat Jena"}]}