{"title": "Learning SMaLL Predictors", "book": "Advances in Neural Information Processing Systems", "page_first": 9125, "page_last": 9135, "abstract": "We introduce a new framework for learning in severely resource-constrained settings. Our technique delicately amalgamates the representational richness of multiple linear predictors with the sparsity of Boolean relaxations, and thereby yields classifiers that are compact, interpretable, and accurate. We provide a rigorous formalism of the learning problem, and establish fast convergence of the ensuing algorithm via relaxation to a minimax saddle point objective. We supplement the theoretical foundations of our work with an extensive empirical evaluation.", "full_text": "Learning SMaLL Predictors\n\nVikas K. Garg\nCSAIL, MIT\n\nvgarg@csail.mit.edu\n\nOfer Dekel\n\nMicrosoft Research\n\noferd@microsoft.com\n\nLin Xiao\n\nMicrosoft Research\n\nlin.xiao@microsoft.com\n\nAbstract\n\nWe introduce a new framework for learning in severely resource-constrained set-\ntings. Our technique delicately amalgamates the representational richness of multi-\nple linear predictors with the sparsity of Boolean relaxations, and thereby yields\nclassi\ufb01ers that are compact, interpretable, and accurate. We provide a rigorous\nformalism of the learning problem, and establish fast convergence of the ensuing\nalgorithm via relaxation to a minimax saddle point objective. We supplement the\ntheoretical foundations of our work with an extensive empirical evaluation.\n\n1\n\nIntroduction\n\nModern advances in machine learning have produced models that achieve unprecedented accuracy\non standard prediction tasks. However, this remarkable progress in model accuracy has come at a\nsigni\ufb01cant cost. Many state-of-the-art models have ballooned in size and applying them to a new\npoint can require tens of GFLOPs, which renders these methods ineffectual on resource-constrained\nplatforms like smart phones and wearables [1, 2]. Indeed, in these settings, inference with a compact\nlearner that can \ufb01t on the small device becomes an overarching determinant even if it comes at the\nexpense of slightly worse accuracy. Moreover, large models are often dif\ufb01cult to interpret, simply\nbecause humans are not good at reasoning about large, complex objects. Modern machine learning\nmodels are also more costly to train, but we sidestep that problem in this paper by assuming that we\ncan train our models on powerful servers in the cloud.\nIn our pursuit of compact and interpretable models, we take inspiration from the classic problem\nof learning disjunctive normal forms (DNFs) [3]. Speci\ufb01cally, a p-term k-DNF is a DNF with p\nterms, where each term contains exactly k Boolean variables. Small DNFs are a natural starting point\nfor our research, because they pack a powerful nonlinear descriptive capacity in a succinct form.\nThe DNF structure is also known to be intuitive and interpretable by humans [4, 5]. However, with\nthe exception of a few practical heuristics [4, 5, 6], an overwhelming body of work [7, 8, 9, 10, 11,\n12, 13, 14, 15, 16, 17] theoretically characterizes the dif\ufb01culty of learning a k-DNF under various\nrestricted models of learning. Our method, Sparse Multiprototype Linear Learner (SMaLL), bypasses\nthis issue by crafting a continuous relaxation that amounts to a form of improper learning of the\nk-DNFs in the sense that the hypothesis space subsumes p-term k-DNF classi\ufb01ers, and thus is at\nleast as powerful as the original k-DNF family. Armed with our technical paraphernalia, we design a\npractical algorithm that yields small and interpretable models.\nOur work may also be viewed as a delicate fusion of multiple prototypes [1, 18, 19, 20, 21, 22]\nwith Boolean relaxations [23]. The richness of models with multiple prototypes overcomes the\nrepresentational limitations of sparse linear models like Lasso and Elastic-Net [24, 25, 26] that are\ntypically not expressive enough to achieve state-of-the-art accuracy. Boolean relaxations afford\nus the ability to control the degree of sparsity explicitly in our predictors akin to exploiting an `0\nregularization, unlike the `1 based methods that may require extensive tuning. Thus, our approach\nharnesses the best of both worlds. Moreover, folding sparsity in the training objective obviates\nthe costs that would otherwise be incurred in compressing a large model via methods like pruning\n[27, 28, 29], low-rank approximation [30, 31], hashing [32], or parameter quantization [27, 33].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fAdditionally, we overcome some signi\ufb01cant limitations of other methods that use a small number\nof prototypes, such as [1, 19, 34]. These techniques invariably require solving highly non-convex\nor combinatorially hard mixed integer optimization problems, which makes it dif\ufb01cult to guarantee\ntheir convergence and optimality. We derive a minimax saddle-point relaxation that provably admits\nO.1=t / convergence via our customized Mirror-Prox algorithm. We provide detailed empirical results\nthat demonstrate the bene\ufb01ts of our approach on a large variety of OpenML datasets. Speci\ufb01cally, on\nmany of these datasets, our algorithm either surpasses the accuracy of the state-of-the-art baselines,\nor provides more compact models while being competent in terms of accuracy.\nIn Section 2, we formulate the problem of learning a k-sparse p-prototype linear predictor as a mixed\ninteger nonlinear optimization problem. Then, in Section 3, we relax this optimization problem to a\nsaddle-point problem, which we solve using a Mirror-Prox algorithm. Finally, we present empirical\nresults in Section 4. All the proofs are provided in the Supplementary to keep the exposition focused.\n\n2 Problem Formulation\nWe \ufb01rst derive a convex loss function for multiprototype binary classi\ufb01cation. Let f.xi ; yi /gm\niD1 be a\ntraining set of instance-label pairs, where each xi 2 Rn and each yi 2 f(cid:0)1; 1g. Let ` W R 7! R be a\nD 1 if yi f .xi / < 0 and 0 otherwise. We\nconvex surrogate for the error indicator function 1f .xi /\u00a4yi\nalso assume that ` upper bounds the error indicator function and is monotonically non-increasing. In\nparticular, the popular hinge-loss and log-loss functions satisfy these properties.\n\u0001\nLet fwjgp\n\njD1 be a set of linear prototypes. We consider a binary classi\ufb01er of the form\n\nmax\nj2\u0152p\u008d\nOur decision rule is motivated by the following result.\nProposition 1. Consider the class Ck D f.w1; w2; : : : ; wp/j8j 2 \u0152p\u008d; wj 2 Rn;jjwjjj0 D kg of\np prototypes, where each prototype is k-sparse for k (cid:21) 0. For any x 2 Rn, let the predictors\nf D .w1; w2; : : : ; wp/ 2 Ck take the following form:\n\n(cid:16)\nf .x/ D sign\n\nwj (cid:1) x\n\n:\n\nf .x/ D 1 if max\nj2\u0152p\u008d\n\nwj (cid:1) x (cid:21) k; and (cid:0) 1 otherwise:\n\nLearning Ck amounts to improper learning of p-term k-DNF Boolean formulae.\n\nThus, our search space contains the family of k-DNF classi\ufb01ers, though owing to the hardness of\nlearning k-DNF, we may not always \ufb01nd a k-DNF classi\ufb01er. Nonetheless, due to improper learning,\nthe value of the objective returned will be a lower bound on the cost objective achieved by the space\nof p-term k-DNF classi\ufb01ers (much like the relation between an integer program and its relaxation).\nWe handle the negative and positive examples separately. For each negative training example .xi ;(cid:0)1/,\nthe classi\ufb01er makes a correct prediction if and only if maxj2\u0152p\u008d wj (cid:1) xi < 0. Under our assumptions\non `, the error indicator function can be upper bounded as\n\n1f .xi /\u00a4(cid:0)1 \u0004 `\n\nwj (cid:1) xi\n\n`.(cid:0)wj (cid:1) xi /;\n\nj2\u0152p\u008d\n\nj2\u0152p\u008d\nwhere the equality holds because we assume that ` is monotonically non-increasing. We note that the\nupper bound maxj2\u0152p\u008d `.(cid:0)wj (cid:1) xi / is jointly convex in fwjgp\nFor each positive example .xi ;C1/, the classi\ufb01er makes a correct prediction if and only if\nmaxj2\u0152p\u008d wj (cid:1) xi > 0. By our assumptions on `, we have\n\njD1 [35, Section 3.2.3].\n\n(cid:16)(cid:0) max\n\n\u0001 D max\n\n(cid:16)\n\n\u0001 D min\n\nj2\u0152p\u008d\n\n1f .xi /\u00a4C1 \u0004 `\n\nwj (cid:1) xi\n\nmax\nj2\u0152p\u008d\n\n`.wj (cid:1) xi /:\n\n(1)\n\nAgain, the equality above is due to the monotonic non-increasing property of `. Here the right-hand\nside minj2\u0152p\u008d `.wj (cid:1) xi / is not convex in fwjgp\njD1. We resolve this by designating a dedicated\nprototype wj.i / for each positive training example .xi ;C1/, and using the upper bound\n\n1f .xi /\u00a4C1 \u0004 `.wj.i / (cid:1) xi /:\n\n2\n\n\fIn the extreme case, we can associate each positive example with a distinct prototype. Then there\nwill be no loss of using `.wj.i / (cid:1) xi / compared with the upper bound in (1) when we set j.i / D\narg maxj2\u0152p\u008d wj (cid:1) xi. However, in this case, the number of prototypes p is equal to the number of\npositive examples, which can be excessively large for storage and computation as well as cause\nover\ufb01tting. In practice, we may cluster the positive examples into p groups, where p is much smaller\nthan the number of positive examples, and assign all positive examples in each group with a common\nprototype. In other words, we have j.i / D j.k/ if xi and xk belong to the same cluster. This\nclustering step helps us provide a fast parametric alternative to the essentially non-parametric setting\nthat assumes one prototype per positive example.\nOverall, we have the following convex surrogate for the total number of training errors:\n\nh.w1; : : : ; wp/ DX\n\n`(cid:0)wj.i / (cid:1) xi\n\n(cid:1) C X\n\n(2)\nwhere IC D fi W yi D C1g and I(cid:0) D fi W yi D (cid:0)1g. In the rest of this paper, we let W 2 Rp(cid:2)n be\nthe matrix formed by stacking the vectors wT\np vertically, and denote the above loss function\nby h.W /. In order to train a multi-prototype classi\ufb01er, we minimize the regularized surrogate loss:\n\n1 ; : : : ; wT\n\nmax\nj2\u0152p\u008d\n\ni2IC\n\ni2I(cid:0)\n\n`.(cid:0)wj (cid:1) xi /;\n\nwhere k (cid:1) kF denotes the Frobenius norm of a matrix.\n\nmin\n\nW 2Rp(cid:2)n\n\n1\n\nm\n\n2\n\nh.W / C (cid:21)\n\nkW k2\nF ;\n\n(3)\n\n2.1 Smoothing the Loss via Soft-Max\nIn this paper, we focus on the log-loss `.z/ D log.1Cexp.(cid:0)z//. Although this ` is a smooth function,\nthe overall loss h de\ufb01ned in (2) is non-smooth, due to the max operator in the sum over the set I(cid:0).\nIn order to take advantage of fast algorithms for smooth convex optimization, we smooth the loss\nfunction using soft-max. More speci\ufb01cally, we replace the non-smooth terms maxj2\u0152p\u008d `.tj / in (2)\nwith the soft-max operator over p items:\n\n(4)\n\n(5)\n\nwhere t D .t1; : : : ; tp/ 2 Rp. Then we obtain the smoothed loss function\n\n\u0002\n1 C X\n`(cid:0)wj.i / (cid:1) xi\n\nj2\u0152p\u008d\n\n\u0003\nexp.(cid:0)tj /\n\n;\n\n(cid:1) CX\n\ni2I(cid:0)\n\nu.W xi /;\n\nu.t / (cid:44) log\n\nh.W / D X\n\nQ\n\ni2IC\n\naround which we will customize our algorithm design. Next, we incorporate sparsity constraints\nexplicitly for the prototypes w1; : : : ; wp.\n\nIncorporating Sparsity via Binary Variables\n\n2.2\nWith some abuse of notation, we let kwjk0 denote the number of non-zero entries of the vector wj ,\nand de\ufb01ne\n\nkW k0;1 (cid:44) max\nj2\u0152p\u008d\n\nkwjk0:\n\nThe requirement that each prototype be k-sparse translates into the constraint kW k0;1 \u0004 k. Therefore\nthe problem of training a SMaLL model with budget k (for each prototype) can be formulated as\n\nQ\nh.W / C (cid:21)\n\n2\n\n1\n\nm\n\njjW jj2\nF ;\n\n(6)\n\nmin\n\nW 2Rp(cid:2)n\njjW jj0;1\u0004k\n\nwhere Q\nh is de\ufb01ned in (5). This is a very hard optimization problem due to the nonconvex sparsity\nconstraint. In order to derive a convex relaxation, we follow the approach of [23] (cf. [36]) to\nintroduce a binary matrix (cid:15) 2 f0; 1gp(cid:2)n and rewrite (6) as\n\nmin\n\nW 2Rp(cid:2)n\n\n(cid:15)2f0;1gp(cid:2)n; k(cid:15)k1;1\u0004k\n\nQ\nh.W \u02c7 (cid:15)/ C (cid:21)\n\n2\n\n1\n\nm\n\njjW \u02c7 (cid:15)jj2\nF ;\n\n3\n\n\fwhere \u02c7 denotes the Hadamard (i.e. entry-wise) product of two matrices. Here we have\n\nk(cid:15)k1;1 D max\nj2\u0152p\u008d\n\nk(cid:15)jk1 ;\n\nwhere (cid:15)j is the j th row of (cid:15). Since all entries of (cid:15) belong to f0; 1g, the constraint k(cid:15)k1;1 \u0004 k is the\nsame as k(cid:15)k0;1 \u0004 k. Noting that we can take Wij D 0 when (cid:15)ij D 0 and vice-versa, this problem is\nequivalent to\n\nmin\n\nW 2Rp(cid:2)n\n\n1\n\nQ\nh.W \u02c7 (cid:15)/ C (cid:21)\n\njjW jj2\nF :\n\n(7)\n\nUsing (5), the objective function can be written as\n\n\u0002X\n\nm\n\n(cid:15)2f0;1gp(cid:2)n; jj(cid:15)jj1;1\u0004k\n\n`(cid:0).W \u02c7 (cid:15)/j.i /xi\n\n(cid:1) CX\n\n2\n\nu(cid:0)(cid:0).W \u02c7 (cid:15)/xi\n\n(cid:1)\u0003 C (cid:21)\n\n1\n\nkW k2\nF ;\n\nm\n\ni2IC\n\ni2I(cid:0)\nwhere .W \u02c7 (cid:15)/j.i / denotes the j.i /th row of W \u02c7 (cid:15).\nSo far our transformations have not changed the nature of the optimization problem with sparsity\nconstraints \u2014 it is still a hard mixed-integer nonlinear optimization problem. However, as we will\nshow in the next section, the introduction of the binary matrix (cid:15) allows us to derive a saddle-point\nformulation of problem (7), which in turn admits a convex-concave relaxation that can be solved\nef\ufb01ciently by the Mirror-Prox algorithm [37, 38].\n\n2\n\n3 Saddle-Point Relaxation\n\nWe \ufb01rst show that the problem in (7) is equivalent to the following minimax saddle-point problem:\n(8)\nwhere S 2 Rp(cid:2)m, each of its column si belongs to a set Si (cid:26) Rp (which will be given in Proposi-\ntion 2), and the function \u02c6 is de\ufb01ned as\n\n(cid:15)2f0;1gp(cid:2)n; jj(cid:15)jj1;1\u0004k\n\nSD\u0152s1(cid:1)(cid:1)(cid:1)sm\u008d\nsi2Si ; i2\u0152m\u008d\n\nW 2Rp(cid:2)n\n\n\u02c6.W; (cid:15); S /;\n\nmax\n\nmin\n\n\u02c6.W; (cid:15); S /D 1\n\ni .W \u02c7 (cid:15)/xi (cid:0) u?.si /\n\nyi sT\n\njjW jj2\nF :\n\n\u0001 C (cid:21)\n\n2\n\nX\n\n(cid:16)\n\ni2\u0152m\u008d\nm\n\n\u02dasT\ni t (cid:0) u.t /(cid:9)\n\n(9)\nif si;j \u0004 0 8j 2 \u0152p\u008d and 1T si (cid:21) (cid:0)1;\notherwise :\n\nThe equivalence between (7) and (8) is a direct consequence of the following proposition.\nwhere t 2 Rp. Then for i 2 IC, we have\n\nIn the above de\ufb01nition, u? is the convex conjugate of u de\ufb01ned in (4):\nu?.si / D sup\nt2Rp\njD1.(cid:0)si;j / log.(cid:0)si;j / C .1C1T si / log.1C1T si /;\n1;\n\nD(cid:26) Pp\nProposition 2. Let `.z/ D log.1 C exp.(cid:0)z// where z 2 R and u.t / D log(cid:0)1 CP\nj2\u0152p\u008d exp.(cid:0)tj /(cid:1)\nwhere yi D C1 and Si D\u02dasi 2 Rp W si;j.i /2 \u0152(cid:0)1; 0\u008d; si;j D 0 8j \u00a4 j.i /(cid:9): For i 2 I(cid:0), we have\nwhere yi D (cid:0)1 and Si D\u02dasi 2 Rp W 1T s (cid:21) (cid:0)1; si;j \u0004 0 8j 2 \u0152p\u008d(cid:9):\n\ni .W \u02c7 (cid:15)/xi (cid:0) u?.si /(cid:1) ;\n(cid:0)yi sT\ni .W \u02c7 (cid:15)/xi (cid:0) u?.si /(cid:1) ;\n(cid:0)yi sT\n\n`(cid:0).W \u02c7 (cid:15)/j.i /xi\nu(cid:0)(cid:0).W \u02c7 (cid:15)/xi\n\nsi2Si\n\n(cid:1) D max\n(cid:1) D max\n\nsi2Si\n\nWe can further eliminate the variable W in (8). This is facilitated by the following result.\nProposition 3. For any given (cid:15) 2 f0; 1gp(cid:2)n and S 2 S1 (cid:2) (cid:1)(cid:1)(cid:1) (cid:2) Sm, the solution to\n\nis unique and given by\n\nmin\n\nW 2Rp(cid:2)n\n\nW .(cid:15); S / D (cid:0) 1\n\nm(cid:21)\n\n\u02c6.W; (cid:15); S /\n\n(cid:0)si xT\n\ni\n\nyi\n\n(cid:1) \u02c7 (cid:15):\n\n(10)\n\nX\n\ni2\u0152m\u008d\n\n4\n\n\fFigure 1: Decision surfaces of different classi\ufb01er types on a run of the two-dimensional chscase funds\ntoy dataset. Test classi\ufb01cation accuracy is shown at the bottom right of each plot.\n\nNow we substitute W .(cid:15); S / into (8) to obtain\n\nWe note that (cid:30).(cid:15); S / is concave in S (which is to be maximized), but not convex in (cid:15) (which is to be\nminimized). However, because (cid:15) 2 f0; 1gp(cid:2)n, we have (cid:15) \u02c7 (cid:15) D (cid:15) and thus\n\n(cid:30).(cid:15); S / ;\n\nmin\n\n(cid:15)2f0;1gp(cid:2)n\nk(cid:15)k1;1\u0004k\n\nmax\n\nSD\u0152s1(cid:1)(cid:1)(cid:1)sm\u008d\nsi2Si ; i2\u0152m\u008d\n\ni\n\nF\n\nyi\n\nm(cid:21)\n\ni2\u0152m\u008d\n\ni2\u0152m\u008d\n\nu?.si / :\n\n(cid:13)(cid:13)(cid:13)(cid:13)X\n(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:0) X\n(cid:1) \u02c7 (cid:15)\n(cid:0)si xT\n\u0002X\n\u0002X\n\u0003\nxi DX\n(cid:1) \u02c7 (cid:15) \u02c7 (cid:15)\n(cid:0)si xT\n\u0003\n\u0002X\nX\nxi (cid:0) X\n(cid:1) \u02c7 (cid:15)\n(cid:0)si xT\n\ni2\u0152m\u008d\n\nyi sT\ni\n\nyi\n\nyi\n\nyi sT\ni\n\ni\n\ni\n\ni2\u0152m\u008d\n\n(11)\n\n(cid:0)si xT\n\ni\n\n\u0003\n(cid:1) \u02c7 (cid:15)\n\nyi\n\nxi :\n\nwhere\n\n(cid:13)(cid:13)(cid:13)(cid:13)X\n\n(cid:30).(cid:15); S / D (cid:0) 1\n\n(cid:0)si xT\n\ni\n\n(cid:1) \u02c7 (cid:15)\n\nyi\n\nDX\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\nyi sT\ni\n\ni2\u0152m\u008d\nTherefore the objective function (cid:30) in (11) can be written as\n\ni2\u0152m\u008d\n\ni2\u0152m\u008d\n\nF\n\n(cid:30).(cid:15); S / D (cid:0) 1\n\nm(cid:21)\n\ni2\u0152m\u008d\n\ni2\u0152m\u008d\nwhich is concave in S and linear (thus convex) in (cid:15).\nFinally, we relax the integrality constraint on (cid:15) to its convex hull, i.e., (cid:15) 2 \u01520; 1\u008dp(cid:2)n, and consider\n(13)\n\n(cid:30).(cid:15); S /;\n\ni2\u0152m\u008d\n\nmax\n\nmin\n\nu?.si /;\n\n(12)\n\n(cid:15)2\u01520;1\u008dp(cid:2)n\nk(cid:15)k1;1\u0004k\n\nSD\u0152s1(cid:1)(cid:1)(cid:1)sm\u008d\nsi2Si ; i2\u0152m\u008d\n\nwhere (cid:30).(cid:15); S / is given in (12). This is a convex-concave saddle-point problem, which can be solved\nef\ufb01ciently, for example, by the Mirror-Prox algorithm [37, 38].\nAfter \ufb01nding a solution .(cid:15); S / of the relaxed problem (13), we can round the entries of (cid:15) to f0; 1g,\nwhile respecting the constraint k(cid:15)k1;1 \u0004 k (e.g., by rounding the largest k entries of each row to 1\nand the rest entries to 0, or randomized rounding). Then we can recover the prototypes using (10).\n\n3.1 The Mirror-Prox Algorithm\n\nAlgorithm 1 lists the Mirror-Prox algorithm customized for solving the convex-concave saddle-point\nproblem (13), which enjoys a O.1=t / convergence rate [37, 38].\n\n5\n\n\fTable 1: Comparison of test accuracy on low dimensional (n < 20) OpenML datasets. K, in SMaLL,\nwas set to n for these datasets.\n\nRF\n\nGP\n\nLR\n\nDT\n\nGB\n\nAB\n\nkNN\n\nRSVM\n\nSMaLL\nLSVM\nbankruptcy .84\u02d9.07 .83\u02d9.08 .82\u02d9.05 .90\u02d9.05 .80\u02d9.05 .78\u02d9.07 .89\u02d9.06 .81\u02d9.05 .90\u02d9.05 .92\u02d9.06\n.79\u02d9.10 .72\u02d9.06 .68\u02d9.04 .82\u02d9.08 .69\u02d9.13 .70\u02d9.11 .82\u02d9.07 .68\u02d9.09 .71\u02d9.12 .83\u02d9.07\nvineyard\nsleuth1714 .82\u02d9.03 .82\u02d9.04 .81\u02d9.14 .83\u02d9.04 .83\u02d9.06 .82\u02d9.04 .76\u02d9.03 .82\u02d9.06 .80\u02d9.03 .83\u02d9.05\nsleuth1605 .66\u02d9.09 .70\u02d9.07 .64\u02d9.08 .70\u02d9.07 .63\u02d9.09 .66\u02d9.05 .65\u02d9.09 .65\u02d9.09 .72\u02d9.07 .72\u02d9.05\nsleuth1201 .94\u02d9.05 .94\u02d9.03 .92\u02d9.05 .93\u02d9.03 .91\u02d9.05 .90\u02d9.04 .89\u02d9.09 .88\u02d9.06 .91\u02d9.08 .94\u02d9.05\n.93\u02d9.04 .90\u02d9.03 .91\u02d9.04 .92\u02d9.04 .91\u02d9.03 .92\u02d9.03 .93\u02d9.04 .90\u02d9.04 .95\u02d9.04 .94\u02d9.02\nrabe266\n.95\u02d9.04 .93\u02d9.04 .91\u02d9.08 .95\u02d9.04 .89\u02d9.07 .92\u02d9.05 .91\u02d9.06 .91\u02d9.08 .95\u02d9.02 .96\u02d9.04\nrabe148\n.66\u02d9.04 .68\u02d9.05 .66\u02d9.03 .65\u02d9.08 .62\u02d9.04 .57\u02d9.03 .69\u02d9.06 .64\u02d9.03 .65\u02d9.09 .69\u02d9.03\nvis_env\n.74\u02d9.07 .66\u02d9.04 .64\u02d9.09 .73\u02d9.07 .60\u02d9.10 .66\u02d9.11 .66\u02d9.14 .67\u02d9.05 .70\u02d9.05 .75\u02d9.04\nhutsof99\nhuman_dev .88\u02d9.03 .85\u02d9.04 .85\u02d9.03 .89\u02d9.04 .85\u02d9.03 .87\u02d9.03 .88\u02d9.03 .86\u02d9.03 .88\u02d9.02 .89\u02d9.04\nc0_100_10 .77\u02d9.04 .74\u02d9.03 .76\u02d9.03 .77\u02d9.03 .64\u02d9.07 .71\u02d9.05 .79\u02d9.03 .71\u02d9.05 .78\u02d9.01 .77\u02d9.06\n.90\u02d9.05 .84\u02d9.06 .84\u02d9.06 .89\u02d9.04 .84\u02d9.06 .87\u02d9.05 .89\u02d9.04 .84\u02d9.06 .89\u02d9.04 .92\u02d9.04\nelusage\ndiggle_table .65\u02d9.14 .61\u02d9.07 .57\u02d9.08 .65\u02d9.11 .60\u02d9.09 .58\u02d9.07 .57\u02d9.13 .57\u02d9.06 .60\u02d9.13 .68\u02d9.07\n.70\u02d9.02 .68\u02d9.04 .68\u02d9.02 .71\u02d9.03 .71\u02d9.03 .63\u02d9.02 .66\u02d9.05 .69\u02d9.04 .68\u02d9.02 .72\u02d9.06\nbaskball\nmichiganacc .72\u02d9.06 .67\u02d9.06 .71\u02d9.05 .71\u02d9.04 .67\u02d9.06 .66\u02d9.07 .71\u02d9.05 .69\u02d9.04 .71\u02d9.05 .73\u02d9.05\nelection2000 .92\u02d9.04 .90\u02d9.04 .91\u02d9.03 .92\u02d9.02 .91\u02d9.03 .92\u02d9.01 .90\u02d9.07 .92\u02d9.02 .92\u02d9.03 .94\u02d9.02\n\ni\n\ni\n\nfor all i 2 \u0152m\u008d\n\nAlgorithm 1 Customized Mirror-Prox algorithm\n\nGradient step:\nD ProjSi\n\nOs.t /\n\nfor solving the saddle-point problem (13)\nInitialize (cid:15).0/ and S .0/\nfor t D 0; 1; : : : ; T do\n\nEj (cid:44)\u02da(cid:15)j 2 Rn W (cid:15)j i 2 \u01520; 1\u008d; k(cid:15)jk1 \u0004 k(cid:9)\nSet low D(cid:0)1T (cid:15)j (cid:0) k(cid:1) =n\n\nAlgorithm 2 .ProjE / Projection onto the set\nInput: (cid:15)j 2 Rn and a small tolerance t ol.\nClip (cid:15)j;i to \u01520; 1\u008d for all i 2 \u0152n\u008d\nReturn (cid:15)j if 1T (cid:15)j \u0004 k\nBinary search to \ufb01nd t ol-solution\nSet high D maxi2\u0152n\u008d (cid:15)j;i (cid:0) k=n\nwhile low \u0004 high do\n\nO(cid:15).t / D ProjE(cid:0)(cid:15).t / (cid:0) \u02dbtr(cid:15)(cid:30).(cid:15).t /; S .t //(cid:1)\nC \u02c7trsi (cid:30).(cid:15).t /; S .t //(cid:1)\n(cid:0)s.t /\n(cid:15).tC1/ D ProjE(cid:0)(cid:15).t / (cid:0) \u02dbtr(cid:15)(cid:30).O(cid:15).t /; OS .t //(cid:1)\n(cid:0)s.t /\nC \u02c7trsi (cid:30).O(cid:15).t /; OS .t //(cid:1)\nD ProjSi\ntD1 \u02dbt O(cid:15).t /\u0131PT\nO(cid:15) DPT\nOS .t /\u0131PT\nOS DPT\nP\nRound O(cid:15) to f0; 1gp(cid:2)n\nOW D (cid:0) 1\nIn order to use Algorithm 1, we need to \ufb01nd the partial gradients of (cid:30).(cid:15); S /, which are given as\n\nSet (cid:21) D .low C high/=2\nCompute O(cid:15)j W 8i 2 \u0152n\u008d; O(cid:15)j;i D (cid:15)j;i (cid:0) (cid:21)\nClip O(cid:15)j to \u01520; 1\u008dn\nif j1T O(cid:15)j (cid:0) kj < t ol then\nelse if 1T O(cid:15)j > k then\nelse\n\nreturn O(cid:15)j\nSet low D .low C high/=2\nSet high D .low C high/=2\n\ntD1 \u02dbt\ntD1 \u02c7t\ni / \u02c7 O(cid:15)\ni2\u0152m\u008d yi .Osi xT\n\ns.tC1/\nend for\n\nfor all i 2 \u0152m\u008d\n\nend if\nend while\n\nExtra-gradient step:\n\ntD1 \u02c7t\n\nm(cid:21)\n\ni\n\ni\n\nr(cid:15)(cid:30).(cid:15); S / D (cid:0) 1\n\nm(cid:21)\n\nrsi (cid:30).(cid:15); S / D (cid:0) 1\n\nm(cid:21)\n\n\u0002X\n\u0002X\n\ni2\u0152m\u008d\n\nyi\n\nyi\n\ni2\u0152m\u008d\n\n(cid:1)\u0003 \u02c7\u0002X\n(cid:0)si xT\n\u0003\n(cid:1) \u02c7 (cid:15)\n(cid:0)si xT\n\ni2\u0152m\u008d\n\nyi\n\ni\n\nxi ;\n\ni\n\n(cid:0)si xT\n\ni\n\n(cid:1)\u0003\n\n;\n\nyi\n\ni 2 \u0152m\u008d:\n\nThere are two projection operators in Algorithm 1. The \ufb01rst one projects some (cid:15) 2 Rp(cid:2)n onto\n\nE (cid:44) f(cid:15) 2 Rp(cid:2)n W (cid:15) 2 \u01520; 1\u008dp(cid:2)n; k(cid:15)k1;1 \u0004 kg :\n\nThis can be done ef\ufb01ciently by Algorithm 2. Essentially, we perform p independent projections, each\nfor one row of (cid:15) using a bi-section type of algorithm [39, 40, 41]. We have the following result.\n\n6\n\n\fFigure 2: SMaLL applied to the Breast Cancer dataset with k D 3 and p D 2. The blue and orange\ndots represent the test instances from the two classes. The plots show the kernel density estimates\nand the actual values of the non-zero features in each prototype, as well at the \ufb01nal predictor result.\n\nProposition 4. Algorithm 2 computes, up to a speci\ufb01ed tolerance t ol, the projection of any (cid:15) 2 Rp(cid:2)n\nonto E in O .log2.1=t ol// time, where t ol is the input precision for bisection.\nThere are two cases for the projection of si 2 Rp onto the set Si. For i 2 IC, we only need to project\nsi;j.i / onto the interval \u0152(cid:0)1; 0\u008d and set si;j D 0 for all j \u00a4 j.i /. For i 2 I(cid:0), the projection algorithm\nis similar to Algorithm 2, and we omit the details here. The step sizes \u02dbt and \u02c7t can be set according\nto the guidelines described in [37, 38], based on the smoothness properties of the function (cid:30).(cid:15); S /.\nIn practice, we follow the adaptive tuning procedure developed in [42].\n\n4 Experiments\n\nWe demonstrate the merits of SMaLL via an extensive set of experiments. We start with an intuition\ninto how the class of sparse multiprototype linear predictors differs from standard model classes.\nFigure 1 is a visualization of the decision surface of different types of classi\ufb01ers on the 2-dimensional\nchscase funds toy dataset, obtained from OpenML. The two classes are shown in red and blue, with\ntraining data in solid shade and test data in translucent shade. The color of each band indicates the\ngradation in the con\ufb01dence of prediction - each classi\ufb01er is more con\ufb01dent in the darker regions and\nless con\ufb01dent in the lighter regions. The 2-prototype linear predictor attains the best test accuracy on\nthis toy problem (0:73). Note that some of the examples are highlighted by a black rectangle - the\nlinear classi\ufb01ers (logistic regression and linear SVM) could not distinguish between these examples,\nwhereas the 2-prototype linear predictor was able to segregate and assign them to different bands.\n\n4.1 Low-dimensional Datasets Without Sparsity\nWe now compare the accuracy of SMaLL with k D n (no sparsity) to the accuracy of other standard\nclassi\ufb01cation algorithms, on several low-dimensional (n \u0004 20) binary classi\ufb01cation datasets from\nthe OpenML repository. We experimented with OpenML data for two main reasons: (a) it contains\nmany preprocessed binary datasets, and (b) the datasets come from diverse domains. The methods\nthat we compare against are linear SVM (LSVM), SVM with non-linear kernels such as radial\n\n7\n\n\fFigure 3: Comparison on high dimensional (n >D 50) OpenML data from the Fri series. Each\nstacked bar shows average test accuracy on left, and the total number of selected features on right.\n\nbasis function, polynomial, and sigmoid (RSVM), Logistic Regression (LR), Decision Trees (DT),\nRandom Forest (RF), k-Nearest Neighbor (kNN), Gaussian Process (GP), Gradient Boosting (GB),\nand AdaBoost (AB). All the datasets were normalized to make each feature have zero mean and unit\nvariance. Since the datasets do not specify separate train, validation, and test sets, we measure test\naccuracy by averaging over \ufb01ve random train-test splits. Since we are interested in extreme sparsity,\nwe pre-clustered the positive examples into p D 2 clusters, and initialized the prototypes with the\ncluster centers. We determined hyperparameters by 5-fold cross-validation. The coef\ufb01cient of the\nerror term C in LSVM and `2-regularized LR was selected from f0:1; 1; 10; 100g. In the case of\nRSVM, we also added 0:01 to the search set for C , and chose the best kernel between a radial basis\nfunction (RBF), polynomials of degree 2 and 3, and sigmoid. For the ensemble methods (RF, AB,\nGB), the number of base predictors was selected from the set f10; 20; 50g. The maximum number of\nfeatures for RF estimators was optimized over the square root and the log selection criteria. We also\nfound best validation parameters for DT (gini or entropy for attribute selection), kNN (1, 3, 5 or 7\nneighbors), and GP (RBF kernel scaled with scaled by a coef\ufb01cient in the set f0:1; 1:0; 5g and dot\nproduct kernel with inhomogeneity parameter (cid:27) set to 1). Finally, for our method SMaLL, we \ufb01xed\n(cid:21) D 0:1 and \u02dbt D 0:01, and searched over \u02c7t D \u02c7 2 f0:01; 0:001g.\nTable 1 shows the test accuracy for the different algorithms on different datasets. As seen from\nthe table, SMaLL with k D n generally performed extremely well on most of these datasets. This\nsubstantiates the practicality of SMaLL in the low dimensional regime.\n\n4.2 Higher-dimensional Datasets with Sparsity\n\nWe now describe results with higher dimensional data, where feature selection becomes especially\ncritical. To substantiate our claim that SMaLL produces an interpretable model, we ran SMaLL on\nthe Breast Cancer dataset with k D 3 and p D 2 (two prototypes, three non-zero elements in each).\nFigure 2 shows the kernel density estimates and the actual values of the selected features in each\nprototype, and the summary of our predictor. Note that the feature perimeter_worst appears in both\nprototypes. As the rightmost plot shows, the predictor output provides a good separation of the test\ndata, and SMaLL registered a test accuracy of over 94%. It is straightforward to understand how the\nresulting classi\ufb01er reaches its decisions: which features it relies on and how those features interact.\nNext, we compare SMaLL with 8 other methods. Six of these methods induce sparsity by minimizing\nan `1-regularized loss function. These methods minimize one of the three empirical loss functions\n(hinge loss, log loss, and the binary-classi\ufb01cation Huber loss), regularized by either an `1 or an elastic\nnet penalty (i.e. `1 and `2). We refer to these as L1Hi (`1, hinge), L1L (`1, log), L1Hu (`1, Huber),\nEnHi (elastic net, hinge), ENL (elastic net, log) and ENHu (elastic net, huber). We also compare\n\n8\n\n\fFigure 4: The big picture. The plot depicts the performance of SMaLL compared to both the\nstandard classi\ufb01cation algorithms and the sparse baselines on the fri_c0_1000_50 dataset. The\nnumber atop each bar is the average number of features selected by that algorithm across 5 runs.\n\nwith two state-of-the-art methods for the scarce-resource setting. ProtoNN [1] is a modern take on\nnearest neighbor classi\ufb01ers, while Bonsai [2] is a sophisticated enhancement of a small decision tree.\nNote that while we can explicitly control the amount of sparsity in SMaLL, ProtoNN, and Bonsai,\nthe methods that use `1 or elastic net regularization do not have this \ufb02exibility. Therefore, in order\nto get the different baselines on the same footing, we devised the following empirical methodology.\nWe speci\ufb01ed p (cid:3) k D 6 features as the desired sparsity, and modulated each linear baseline to yield\nnearly these many features. We trained each of the linear baselines by setting a high value of the `1\ncoef\ufb01cient and selected the features with the largest absolute values. Then, we retrained the classi\ufb01er\nusing only the selected features, using the same loss (hinge, loss, or log) and an `2 regularization. Our\nprocedure ensured that each baseline bene\ufb01ted, in effect, from an elastic net-like regularization while\nhaving the most important features at its disposal. For the SMaLL classi\ufb01er, we \ufb01xed k D 3 and\np D 2. In practice, this setting will be application speci\ufb01c (e.g., it would likely depend on the budget).\nAs before, since the original dataset did not specify a train-test split, our results were averaged over\n\ufb01ve random splits. The parameters for each method were tuned using 5-fold cross-validation. We\n\ufb01xed (cid:21) D 0:1 and performed a joint search over \u02dbt 2 f0:1; 1e (cid:0) 2; 1e (cid:0) 3g and \u02c7t 2 f1e (cid:0) 3; 1e (cid:0) 4g.\nFor all the baselines, we optimized the cross validation error over the `1 regularization coef\ufb01cients in\nthe set f1e (cid:0) 1; 1e (cid:0) 2; 1e (cid:0) 3; 1e (cid:0) 4g. Moreover, in case of elastic net, the ratio of the `1 coef\ufb01cient\nto the `2 coef\ufb01cient was set to 1. The depth of the estimators in Bonsai was selected from f2; 3; 4g.\nFinally, the dimensionality of projection in ProtoNN was searched over f5; 10; 15; 20g.\nFigure 3 provides strong empirical evidence that SMaLL compares favorably to the baselines on\nseveral high dimensional OpenML datasets belonging to the Fri series. Speci\ufb01cally, the \ufb01rst number\nin each dataset name indicates the number of examples, and the second the dimensionality of the\ndataset. Note that in case of SMaLL, some features might be selected in more than one prototype.\nTherefore, to be fair to the other methods, we included the multiplicity while computing the total\nfeature count. We observe that, on all but one of these datasets, SMaLL outperformed the ProtoNN\nand Bonsai models at the same level of sparsity, and the gap between SMaLL and these methods\ngenerally turned out to be huge. Moreover, compared to the linear baselines, SMaLL achieved\nconsistently better performance at much sparser levels. This shows the promise of SMaLL toward\nachieving succinct yet accurate predictors in the high dimensional regime. The merits of SMaLL are\nfurther reinforced in Fig. 4 that shows the accuracy-sparsity trade-offs. We observe that just with 6\nfeatures, SMaLL provides better test accuracy compared to all the baselines but GB and AB. This\nshows the potential of SMaLL as a practical algorithm for resource de\ufb01cient environments.\n\n9\n\n\fReferences\n[1] C. Gupta, A. S. Suggala, A. Goyal, H. V. Simhadri, B. Paranjape, A. Kumar, S. Goyal, R. Udupa,\nM. Varma, and P. Jain. ProtoNN: compressed and accurate kNN for resource-scarce devices. In\nICML, pages 1331\u20131340, 2017.\n\n[2] A. Kumar, S. Goyal, and M. Varma. Resource-ef\ufb01cient machine learning in 2 kb ram for the\n\ninternet of things. In ICML, pages 1935\u20131944, 2017.\n\n[3] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142,\n\n1984.\n\n[4] J. R. Hauser, O. Toubia, T. Evgeniou, R. Befurt, and D. Dzyabura. Disjunctions of conjunctions,\ncognitive simplicity, and consideration sets. Journal of Marketing Research, 47(3):485\u2013496,\n2010.\n\n[5] T. Wang, C. Rudin, F. Doshi, Y. Liu, E. Klamp\ufb02, and P. MacNeille. Bayesian rule sets for\ninterpretable classi\ufb01cation, with application to context-aware recommender systems. JMLR,\n18(70):1\u201337, 2017.\n\n[6] O. Cord. Genetic fuzzy systems: evolutionary tuning and learning of fuzzy knowledge bases,\n\nvolume 19. World Scienti\ufb01c, 2001.\n\n[7] A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, and S. Rudich. Weakly learning\ndnf and characterizing statistical query learning using fourier analysis. In Proceedings of the\nTwenty-sixth Annual ACM Symposium on Theory of Computing (STOC), pages 253\u2013262, 1994.\n\n[8] Y. Mansour. An o (nlog log n) learning algorithm for dnf under the uniform distribution. Journal\n\nof Computer and System Sciences, 50(3):543\u2013550, 1995.\n\n[9] J. C. Jackson. An ef\ufb01cient membership-query algorithm for learning dnf with respect to the\n\nuniform distribution. Journal of Computer and System Sciences, 55(3):414\u2013440, 1997.\n\n[10] K. Verbeurgt. Learning suc-classes of monotone dnf on the uniform distribution. In Proceedings\n\nof the Ninth Conference on Algorithmic Learning Theory, pages 385\u2013399, 1998.\n\n[11] N. H. Bshouty, J. C. Jackson, and C. Tamon. More ef\ufb01cient pac-learning of dnf with membership\nqueries under the uniform distribution. In Computational Learning Theory (COLT), pages\n286\u2013295, 1999.\n\n[12] Y. Sakai and A. Maruoka. Learning monotone log-term dnf formulas under the uniform\n\ndistribution. Theory of Computing Systems, 33(1):17\u201333, 2000.\n\n[13] R. A. Servedio. On learning monotone dnf under product distributions.\n\nComputation, 193(1):57\u201374, 2004.\n\nInformation and\n\n[14] N. H. Bshouty, E. Mossel, R. O\u2019Donnell, and R. A. Servedio. Learning dnf from random walks.\n\nJournal of Computer and System Sciences, 71(3):250\u2013265, 2005.\n\n[15] V. Feldman. Learning DNF expressions from fourier spectrum. In Conference on Learning\n\nTheory (COLT), pages 17.1\u201317.19, 2012.\n\n[16] A. R. Klivans and R. A. Servedio. Learning dnf in time 2o (n1/3). Journal of Computer and\n\nSystem Sciences, 68(2):303\u2013318, 2004.\n\n[17] S. Khot and R. Saket. Hardness of minimizing and learning dnf expressions. In Foundations of\n\nComputer Science (FOCS), pages 231\u2013240, 2008.\n\n[18] F. Aiolli and A. Sperduti. Multiclass classi\ufb01cation with multi-prototype support vector machines.\n\nJMLR, 6:817\u2013850, 2005.\n\n[19] O. Dekel, S. Shalev-Shwartz, and Y. Singer. The Forgetron: A kernel-based perceptron on a\n\nbudget. SIAM Journal on Computing, 37(5):1342\u20131372, 2008.\n\n[20] O. Dekel and Y. Singer. Support vector machines on a budget. In NIPS, pages 345\u2013352, 2007.\n\n10\n\n\f[21] M. Kusner, S. Tyree, K. Q. Weinberger, and K. Agrawal. Stochastic neighbor compression. In\n\nICML, pages 622\u2013630, 2014.\n\n[22] K. Zhong, R. Guo, S. Kumar, B. Yan, D. Simcha, and I. Dhillon. Fast Classi\ufb01cation with Binary\n\nPrototypes. In AISTATS, pages 1255\u20131263, 2017.\n\n[23] M. Pilanci and M. J. Wainwright. Sparse learning via Boolean relaxations. Mathematical\n\nProgramming, 151:63\u201387, 2015.\n\n[24] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (methodological), 58(1):267\u2013288, 1996.\n\n[25] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the\n\nRoyal Statistical Society. Series B (methodological), 67(2):301\u2013320, 2005.\n\n[26] V. K. Garg, L. Xiao, and O. Dekel. Sparse Multiprototype Classi\ufb01cation. In UAI, 2018.\n[27] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. In ICLR, 2016.\n\n[28] F. Nan, J. Wang, and V. Saligrama. Pruning random forests for prediction on a budget. In NIPS,\n\npages 2334\u20132342, 2016.\n\n[29] J.-H. Luo, J. Wu, and W. Lin. Thinet: A \ufb01lter level pruning method for deep neural network\n\ncompression. In ICCV, pages 5068\u20135076, 2017.\n\n[30] T. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank matrix\nfactorization for deep neural network training with high-dimensional output targets. In ICASSP,\npages 6655\u20136659, 2013.\n\n[31] P. Nakkiran, R. Alvarez, R. Prabhavalkar, and C. Parada. Compressing deep neural networks\nusing a rank-constrained topology. In Sixteenth Annual Conference of the International Speech\nCommunication Association, 2015.\n\n[32] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with\n\nthe hashing trick. In ICML, pages 2285\u20132294, 2015.\n\n[33] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks.\n\nIn NIPS, pages 4107\u20134115, 2016.\n\n[34] D. Bertsimas and R. Shioda. Classi\ufb01cation and regression via integer optimization. Operations\n\nResearch, 55(2):252\u2013271, 2007.\n\n[35] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[36] M. Tan, I. W. Tsang, and Li Wang. Towards ultrahigh dimensional feature selection for big data.\n\nJMLR, 15(1):1371\u20131429, 2014.\n\n[37] A. Nemirovski. Prox-method with rate of convergence O.1=t / for variational inequalities with\nLipschitz continuous monotone operators and smooth convex-concave saddle point problems.\nSIAM Journal on Optimization, 15(1):229\u2013251, 2004.\n\n[38] A. Juditsky and A. Nemirovski. First-order methods for nonsmooth convex large-scale opti-\nmization, II: Utilizing problems\u2019s structure. In S. Sra, S. Nowozin, and S. J. Wright, editors,\nOptimization for Machine Learning, chapter 6, pages 149\u2013184. The MIT Press, Cambridge,\nMA., 2011.\n\n[39] P. Brucker. An O.n/ algorithm for quadratic Knapsack problems. Operations Research Letters,\n\n3(3):163\u2013166, 1984.\n\n[40] P. M. Pardalos and N. Kovoor. An algorithm for a singly constrained class of quadratic programs\n\nsubject to upper and lower bounds. Mathematical Programming, 46:321\u2013328, 1990.\n\n[41] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projection onto the `1-ball for\n\nlearning in high dimensions. In ICML, pages 272\u2013279, 2008.\n\n[42] A. Jalali, M. Fazel, and L. Xiao. Variational Gram functions: Convex analysis and optimization.\n\nSIAM Journal on Optimization, 27(4):2634\u20132661, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5484, "authors": [{"given_name": "Vikas", "family_name": "Garg", "institution": "MIT"}, {"given_name": "Ofer", "family_name": "Dekel", "institution": "Microsoft Research"}, {"given_name": "Lin", "family_name": "Xiao", "institution": "Microsoft Research"}]}