{"title": "Multi-output Polynomial Networks and Factorization Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 3349, "page_last": 3359, "abstract": "Factorization machines and polynomial networks are supervised polynomial models based on an efficient low-rank decomposition. We extend these models to the multi-output setting, i.e., for learning vector-valued functions, with application to multi-class or multi-task problems. We cast this as the problem of learning a 3-way tensor whose slices share a common basis and propose a convex formulation of that problem. We then develop an efficient conditional gradient algorithm and prove its global convergence, despite the fact that it involves a non-convex basis selection step. On classification tasks, we show that our algorithm achieves excellent accuracy with much sparser models than existing methods. On recommendation system tasks, we show how to combine our algorithm with a reduction from ordinal regression to multi-output classification and show that the resulting algorithm outperforms simple baselines in terms of ranking accuracy.", "full_text": "Multi-output Polynomial Networks\n\nand Factorization Machines\n\nMathieu Blondel\n\nNTT Communication Science Laboratories\n\nKyoto, Japan\n\nmathieu@mblondel.org\n\nVlad Niculae\u2217\nCornell University\n\nIthaca, NY\n\nvlad@cs.cornell.edu\n\nTakuma Otsuka\n\nNaonori Ueda\n\nNTT Communication Science Laboratories\n\nNTT Communication Science Laboratories\n\nKyoto, Japan\n\notsuka.takuma@lab.ntt.co.jp\n\nRIKEN\n\nKyoto, Japan\n\nueda.naonori@lab.ntt.co.jp\n\nAbstract\n\nFactorization machines and polynomial networks are supervised polynomial mod-\nels based on an ef\ufb01cient low-rank decomposition. We extend these models to the\nmulti-output setting, i.e., for learning vector-valued functions, with application to\nmulti-class or multi-task problems. We cast this as the problem of learning a 3-way\ntensor whose slices share a common basis and propose a convex formulation of that\nproblem. We then develop an ef\ufb01cient conditional gradient algorithm and prove\nits global convergence, despite the fact that it involves a non-convex basis selec-\ntion step. On classi\ufb01cation tasks, we show that our algorithm achieves excellent\naccuracy with much sparser models than existing methods. On recommendation\nsystem tasks, we show how to combine our algorithm with a reduction from ordinal\nregression to multi-output classi\ufb01cation and show that the resulting algorithm\noutperforms simple baselines in terms of ranking accuracy.\n\n1\n\nIntroduction\n\nInteractions between features play an important role in many classi\ufb01cation and regression tasks.\nClassically, such interactions have been leveraged either explicitly, by mapping features to their\nproducts (as in polynomial regression), or implicitly, through the use of the kernel trick. While fast\nlinear model solvers have been engineered for the explicit approach [9, 28], they are typically limited\nto small numbers of features or low-order feature interactions, due to the fact that the number of\nparameters that they need to learn scales as O(dt), where d is the number of features and t is the order\nof interactions considered. Models kernelized with the polynomial kernel do not suffer from this\nproblem; however, the cost of storing and evaluating these models grows linearly with the number of\ntraining instances, a problem sometimes referred to as the curse of kernelization [30].\n\nFactorization machines (FMs) [25] are a more recent approach that can use pairwise feature interac-\ntions ef\ufb01ciently even in very high-dimensional data. The key idea of FMs is to model the weights\nof feature interactions using a low-rank matrix. Not only this idea offers clear bene\ufb01ts in terms of\nmodel compression compared to the aforementioned approaches, it has also proved instrumental\nin modeling interactions between categorical variables, converted to binary features via a one-hot\nencoding. Such binary features are usually so sparse that many interactions are never observed in the\n\n\u2217Work performed during an internship at NTT Commmunication Science Laboratories, Kyoto.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ftraining set, preventing classical approaches from capturing their relative importance. By imposing\na low rank on the feature interaction weight matrix, FMs encourage shared parameters between\ninteractions, allowing to estimate their weights even if they never occurred in the training set. This\nproperty has been used in recommender systems to model interactions between user variables and\nitem variables, and is the basis of several industrial successes of FMs [32, 17].\n\nOriginally motivated as neural networks with a polynomial activation (instead of the classical\nsigmoidal or recti\ufb01er activations), polynomial networks (PNs) [20] have been shown to be intimately\nrelated to FMs and to only subtly differ in the non-linearity they use [5]. PNs achieve better\nperformance than recti\ufb01er networks on pedestrian detection [20] and on dependency parsing [10],\nand outperform kernel approximations such as the Nystr\u00f6m method [5]. However, existing PN and\nFM works have been limited to single-output models, i.e., they are designed to learn scalar-valued\nfunctions, which restricts them to regression or binary classi\ufb01cation problems.\n\nOur contributions. In this paper, we generalize FMs and PNs to multi-output models, i.e., for\nlearning vector-valued functions, with application to multi-class or multi-task problems.\n\n1) We cast learning multi-output FMs and PNs as learning a 3-way tensor, whose slices share a\ncommon basis (each slice corresponds to one output). To obtain a convex formulation of that\nproblem, we propose to cast it as learning an in\ufb01nite-dimensional but row-wise sparse matrix. This\ncan be achieved by using group-sparsity inducing penalties. (\u00a73)\n\n2) To solve the obtained optimization problem, we develop a variant of the conditional gradient\n(a.k.a. Frank-Wolfe) algorithm [11, 15], which repeats the following two steps: i) select a new basis\nvector to add to the model and ii) re\ufb01t the model over the current basis vectors. (\u00a74) We prove the\nglobal convergence of this algorithm (Theorem 1), despite the fact that the basis selection step is\nnon-convex and more challenging in the shared basis setting. (\u00a75)\n\n3) On multi-class classi\ufb01cation tasks, we show that our algorithm achieves comparable accuracy to\nkernel SVMs but with much more compressed models than the Nystr\u00f6m method. On recommender\nsystem tasks, where kernelized models cannot be used (since they do not generalize to unseen\nuser-item pairs), we demonstrate how our algorithm can be combined with a reduction from ordinal\nregression to multi-output classi\ufb01cation and show that the resulting algorithm outperforms single-\noutput PNs and FMs both in terms of root mean squared error (RMSE) and ranking accuracy, as\nmeasured by nDCG (normalized discounted cumulative gain) scores. (\u00a76)\n\n2 Background and related work\n\nNotation. We denote the set {1, . . . , m} by [m]. Given a vector v \u2208 Rk, we denote its elements\nby vr \u2208 R \u2200r \u2208 [k]. Given a matrix V \u2208 Rk\u00d7m, we denote its rows by vr \u2208 Rm \u2200r \u2208 [k] and its\ncolumns by v:,c \u2200c \u2208 [m]. We denote the lp norm of V by kV kp := k vec(V )kp and its lp/lq norm\nby kV kp,q :=(cid:16)Pk\nFactorization machines (FMs). Given an input vector x \u2208 Rd, FMs predict a scalar output by\n\n. The number of non-zero rows of V is denoted by kV k0,\u221e.\n\nq(cid:17)\nr=1 kvrkp\n\n1\np\n\n\u02c6yFM := wTx +Xi 0 is a hyper-parameter. In this paper, we focus on the l1\n(lasso), l1/l2 (group lasso) and l1/l\u221e penalties for \u2126, cf. Table 1. However, as we shall see, solving\n(2) is more challenging with the l1/l2 and l1/l\u221e penalties than with the l1 penalty. Although our\nformulation is based on an in\ufb01nite view, we next show that U \u22c6 has \ufb01nite row support.\n\nProposition 1 Finite row support of U \u22c6 for multi-output PNs and FMs\nLet U \u22c6 be an optimal solution of (2), where \u2126 is one of the penalties in Table 1. Then,\nkU \u22c6k0,\u221e \u2264 nm + 1. If \u2126(\u00b7) = k \u00b7 k1, we can tighten this bound to kU \u22c6k0,\u221e \u2264 min(nm + 1, dm).\nProof is in Appendix B.1. It is open whether we can tighten this result when \u2126 = k \u00b7 k1,2 or k \u00b7 k1,\u221e.\n\n4 A conditional gradient algorithm with approximate basis vector selection\n\nAt \ufb01rst glance, learning with an in\ufb01nite number of basis vectors seems impossible. In this section,\nwe show how the well-known conditional gradient algorithm [11, 15] combined with group-sparsity\ninducing penalties naturally leads to a greedy algorithm that selects and adds basis vectors that are\nuseful across all outputs. On every iteration, the conditional gradient algorithm performs updates\n\nof the form U (t+1) = (1 \u2212 \u03b3)U (t) + \u03b3\u2206\u22c6, where \u03b3 \u2208 [0, 1] is a step size and \u2206\u22c6 is obtained by\nsolving a linear approximation of the objective around the current iterate U (t):\n\n\u2206\u22c6 \u2208 argmin\n\n\u2126(\u2206)\u2264\u03c4h\u2206,\u2207F (U (t))i = \u03c4 \u00b7 argmax\n\n\u2126(\u2206)\u22641h\u2206,\u2212\u2207F (U (t))i.\n\n(3)\n\nLet us denote the negative gradient \u2212\u2207F (U ) by G \u2208 R|H|\u00d7m for short. Its elements are de\ufb01ned by\n\nn\n\ngh,c = \u2212\n\n\u03c3(hTxi)\u2207\u2113 (yi, o(xi; U ))c ,\n\nXi=1\n\nwhere \u2207\u2113(y, o) \u2208 Rm is the gradient of \u2113 w.r.t. o (cf. Appendix A). For ReLu activations, solving\n(3) is known to be NP-hard [1]. Here, we focus on quadratic activations, for which we will be able to\nprovide approximation guarantees. Plugging the expression of \u03c3, we get\n\nn\n\n1\n\n2(cid:16)X TDcX \u2212 Dc\n\ndiag(xi)2(cid:17) (FM)\ngh,c = \u2212hT\u0393ch where \u0393c := X TDcX (PN) or \u0393c :=\nand Dc \u2208 Rn\u00d7n is a diagonal matrix such that (Dc)i,i := \u2207\u2113(yi, o(xi; U ))c. Let us recall the\nde\ufb01nition of the dual norm of \u2126: \u2126\u2217(G) := max\u2126(\u2206)\u22641h\u2206, Gi. By comparing this equation to (3),\nwe see that \u2206\u22c6 is the argument that achieves the maximum in the dual norm \u2126\u2217(G), up to a constant\nfactor \u03c4 . It is easy to verify that any element in the subdifferential of \u2126\u2217(G), which we denote by\n\u2202\u2126\u2217(G) \u2286 R|H|\u00d7m, achieves that maximum, i.e., \u2206\u22c6 \u2208 \u03c4 \u00b7 \u2202\u2126\u2217(G).\nBasis selection. As shown in Table 1, elements of \u2202\u2126\u2217(G) (subgradients) are |H|\u00d7 m matrices with\na single non-zero row indexed by h\u22c6, where h\u22c6 is an optimal basis (hidden unit) selected by\n\nXi=1\n\nh\u22c6 \u2208 argmax\n\nh\u2208H kghkp,\n\n4\n\n(4)\n\n\fand where p = \u221e when \u2126 = k \u00b7 k1, p = 2 when \u2126 = k.k1,2 and p = 1 when \u2126 = k \u00b7 k1,\u221e. We\ncall (4) a basis vector selection criterion. Although this selection criterion was derived from the\nlinearization of the objective, it is fairly natural: it chooses the basis vector with largest \u201cviolation\u201d,\nas measured by the lp norm of the negative gradient row gh.\n\nMultiplicative approximations. The key challenge in solving (3) or equivalently (4) arises from the\nfact that G has in\ufb01nitely many rows gh. We therefore cast basis vector selection as a continuous\noptimization problem w.r.t. h. Surprisingly, although the entire objective (2) is convex, (4) is not.\n\nInstead of the exact maximum, we will therefore only require to \ufb01nd a \u02c6\u2206 \u2208 R|H|\u00d7m that satis\ufb01es\n\n\u2126( \u02c6\u2206) \u2264 \u03c4\n\nand\n\nh \u02c6\u2206, Gi \u2265 \u03bdh\u2206\u22c6, Gi,\n\nwhere \u03bd \u2208 (0, 1] is a multiplicative approximation (higher is better). It is easy to verify that this is\nequivalent to replacing the optimal h\u22c6 by an approximate \u02c6h \u2208 H that satis\ufb01es kg\u02c6hkp \u2265 \u03bdkgh\u22c6kp.\nSparse case. When \u2126(\u00b7) = k \u00b7 k1, we need to solve\n\nmax\n\nh\u2208H kghk\u221e = max\nh\u2208H\n\nmax\n\nc\u2208[m]|hT\u0393ch| = max\nc\u2208[m]\n\nmax\n\nh\u2208H |hT\u0393ch|.\n\nIt is well known that the optimal solution of maxh\u2208H |hT\u0393ch| is the dominant eigenvector of \u0393c.\nTherefore, we simply need to \ufb01nd the dominant eigenvector hc of each \u0393c and select \u02c6h as the hc\nwith largest singular value |hT\n\n\u0393chc|. Using the power method, we can \ufb01nd an hc that satis\ufb01es\n|hT\n\n\u0393chc| \u2265 (1 \u2212 \u01eb) max\n\n(5)\n\nc\n\nc\n\nh\u2208H |hT\u0393ch|,\n\nfor some tolerance parameter \u01eb \u2208 (0, 1). The procedure takes O(Nc log(d)/\u01eb) time, where Nc is\nthe number of non-zero elements in \u0393c [26]. Taking the maximum w.r.t. c \u2208 [m] on both sides of\n(5) leads to kg\u02c6hk\u221e \u2265 \u03bdkgh\u22c6k\u221e, where \u03bd = 1 \u2212 \u01eb. However, using \u2126 = k \u00b7 k1 does not encourage\nselecting an \u02c6h that is useful for all outputs. In fact, when \u2126 = k \u00b7 k1, our approach is equivalent to\nimposing independent nuclear norms on W1, . . . , Wm.\nGroup-sparse cases. When \u2126(\u00b7) = k.k1,2 or \u2126(\u00b7) = k.k1,\u221e, we need to solve\nh\u2208H kghk2\n\nh\u2208H kghk1 = max\nh\u2208H\n\n2 = max\nh\u2208H\n\n|hT\u0393ch|,\n\n(hT\u0393ch)2\n\nf2(h) :=\n\nf1(h) :=\n\nor max\n\nmax\n\nrespectively. Unlike the l1-constrained case, we are clearly selecting a basis vector with largest viola-\ntion across all outputs. However, we are now faced with a more dif\ufb01cult non-convex optimization\nproblem. Our strategy is to \ufb01rst choose an initialization h(0) which guarantees a certain multiplicative\napproximation \u03bd, then re\ufb01ne the solution using a monotonically non-increasing iterative procedure.\nInitialization. We simply choose h(0) as the approximate solution of the \u2126 = k\u00b7k1 case, i.e., we have\n\nNow, using \u221amkxk\u221e \u2265 kxk2 \u2265 kxk\u221e and mkxk\u221e \u2265 kxk1 \u2265 kxk\u221e, this immediately implies\n\nkgh(0)k\u221e \u2265 (1 \u2212 \u01eb) max\n\nh\u2208H kghk\u221e.\n\nkgh(0)kp \u2265 \u03bd max\n\nh\u2208H kghkp,\n\nwith \u03bd = 1\u2212\u01eb\u221am if p = 2 and \u03bd = 1\u2212\u01eb\nRe\ufb01ning the solution. We now apply another instance of the conditional gradient algorithm to solve\nthe subproblem maxkhk2\u22641 fp(h) itself, leading to the following iterates:\n\nm if p = 1.\n\nh(t+1) = (1 \u2212 \u03b7t)h(t) + \u03b7t \u2207fp(h(t))\nk\u2207fp(h(t))k2\n\n,\n\n(6)\n\nwhere \u03b7t \u2208 [0, 1]. Following [3, Section 2.2.2], if we use the Armijo rule to select \u03b7t, every limit\npoint of the sequence {h(t)} is a stationary point of fp. In practice, we observe that \u03b7t = 1 is almost\nalways selected. Note that when \u03b7t = 1 and m = 1 (i.e., single-output case), our re\ufb01ning algorithm\nrecovers the power method. Generalized power methods were also studied for structured matrix\nfactorization [16, 21], but with different objectives and constraints. Since the conditional gradient\n\n5\n\nm\n\nXc=1\n\nm\n\nXc=1\n\n\fAlgorithm 1 Multi-output PN/FM training\n\nInput: X, y, k, \u03c4\nH \u2190 [ ], V \u2190 [ ]\nfor t := 1, . . . , k do\n\nr xi)vr \u2200i \u2208 [n]\n\nr=1 \u03c3(hT\n\nCompute oi := Pt\u22121\nLet gh := [\u2212hT\u03931h, . . . , \u2212hT\u0393mh]T\nFind \u02c6h \u2248 argmaxh\u2208H kghkp\nAppend \u02c6h to H and 0 to V\nV \u2190 argmin\n\u2126(V )\u2264\u03c4\n\nFt(V , H)\n\nOptional: V , H \u2190 argmin\n\u2126(V )\u2264\u03c4\n\nFt(V , H)\n\nhr \u2208H \u2200r\u2208[t]\n\nend for\nOutput: V , H (equivalent to U = Pk\n\nt=1 eht vT\nt )\n\n2 otherwise.\n\n2 x2 if |x| \u2264 1, |x| \u2212 1\n\nalgorithm assumes a differentiable function, in the case p = 1, we replace the absolute function with\nthe Huber function |x| \u2248 1\nCorrective re\ufb01tting step. After t iterations, U (t) contains at most t non-zero rows. We can therefore\nalways store U (t) as V (t) \u2208 Rt\u00d7m (the output layer matrix) and H (t) \u2208 Rt\u00d7d (the basis vectors /\nhidden units added so far). In order to improve accuracy, on iteration t, we can then re\ufb01t the objective\nr xi)vr(cid:17). We consider two kinds of corrective steps, a convex\nFt(V , H) :=Pn\none that minimizes Ft(V , H (t)) w.r.t. V \u2208 Rt\u00d7m and an optional non-convex one that minimizes\nFt(V , H) w.r.t. both V \u2208 Rt\u00d7m and H \u2208 Rt\u00d7d. Re\ufb01tting allows to remove previously-added\n\nbad basis vectors, thanks to the use of sparsity-inducing penalties. Similar re\ufb01tting procedures are\ncommonly used in matching pursuit [22]. The entire procedure is summarized in Algorithm 1 and\nimplementation details are given in Appendix D.\n\ni=1 \u2113(cid:16)yi,Pt\n\nr=1 \u03c3(hT\n\n5 Analysis of Algorithm 1\n\nThe main dif\ufb01culty in analyzing the convergence of Algorithm 1 stems from the fact that we cannot\nsolve the basis vector selection subproblem globally when \u2126 = k \u00b7 k1,2 or k \u00b7 k1,\u221e. Therefore, we\nneed to develop an analysis that can cope with the multiplicative approximation \u03bd. Multiplicative\napproximations were also considered in [18] but the condition they require is too stringent (cf.\nAppendix B.2 for a detailed discussion). The next theorem guarantees the number of iterations needed\nto output a multi-output network that achieves as small objective value as an optimal solution of (2).\n\nTheorem 1 Convergence of Algorithm 1\nAssume F is smooth with constant \u03b2. Let U (t) be the output after t iterations of Algorithm 1 run with\n\nconstraint parameter \u03c4\n\n\u03bd . Then, F (U (t)) \u2212 min\n\u2126(U )\u2264\u03c4\n\nF (U ) \u2264 \u01eb \u2200t \u2265\n\n8\u03c4 2\u03b2\n\u01eb\u03bd2 \u2212 2.\n\nIn [20], single-output PNs were trained using GECO [26], a greedy algorithm with similar O(cid:0) \u03c4 2\u03b2\n\u01eb\u03bd2(cid:1)\n\nguarantees. However, GECO is limited to learning in\ufb01nite vectors (not matrices) and it does not\nconstrain its iterates like we do. Hence GECO cannot remove bad basis vectors. The proof of\nTheorem 1 and a detailed comparison with GECO are given in Appendix B.2. Finally, we note\nthat the in\ufb01nite dimensional view is also key to convex neural networks [2, 1]. However, to our\nknowledge, we are the \ufb01rst to give an explicit multiplicative approximation guarantee for a non-linear\nmulti-output network.\n\n6 Experimental results\n\n6.1 Experimental setup\n\nDatasets. For our multi-class experiments, we use four publicly-available datasets: segment (7\nclasses), vowel (11 classes), satimage (6 classes) and letter (26 classes) [12]. Quadratic models sub-\n\n6\n\n\fstantially improve over linear models on these datasets. For our recommendation system experiments,\nwe use the MovieLens 100k and 1M datasets [14]. See Appendix E for complete details.\n\nModel validation. The greedy nature of Algorithm 1 allows us to easily interleave training with\nmodel validation. Concretely, we use an outer loop (embarrassingly parallel) for iterating over the\nrange of possible regularization parameters, and an inner loop (Algorithm 1, sequential) for increasing\nthe number of basis vectors. Throughout our experiments, we use 50% of the data for training, 25%\nfor validation, and 25% for evaluation. Unless otherwise speci\ufb01ed, we use a multi-class logistic loss.\n\n6.2 Method comparison for the basis vector (hidden unit) selection subproblem\n\nAs we mentioned previously, the linearized subprob-\nlem (basis vector selection) for the l1/l2 and l1/l\u221e\nconstrained cases involves a signi\ufb01cantly more chal-\nlenging non-convex optimization problem. In this\nsection, we compare different methods for obtaining\nan approximate solution \u02c6h to (4). We focus on the\n\u21131/\u2113\u221e case, since we have a method for computing\nthe true global solution h\u22c6, albeit with exponential\ncomplexity in m (cf. Appendix C). This allows us\nto report the empirically observed multiplicative\napproximation factor \u02c6\u03bd := f1(\u02c6h)/f1(h\u22c6).\nCompared methods. We compare l1 init + re\ufb01ne (proposed), random init + re\ufb01ne, l1 init (without\nre\ufb01ne), random init and best data: \u02c6h = xi\u22c6 /kxi\u22c6k2 where i\u22c6 = argmax\ni\u2208[n]\nResults. We report \u02c6\u03bd in Figure 2. l1 init + re\ufb01ne achieves nearly the global maximum on both\ndatasets and outperforms random init + re\ufb01ne, showing the effectiveness of the proposed initialization\nand that the iterative update (6) can get stuck in a bad local minimum if initialized badly. On the\nother hand, l1 init + re\ufb01ne outperforms l1 init alone, showing the importance of iteratively re\ufb01ning\nthe solution. Best data, a heuristic similar to that of approximate kernel SVMs [7], is not competitive.\n\nFigure 2: Empirically observed multiplicative\napproximation factor \u02c6\u03bd = f1(\u02c6h)/f1(h\u22c6).\n\nf1(xi/kxik2).\n\n6.3 Sparsity-inducing penalty comparison\n\nIn this section, we compare the l1, l1/l2 and l1/l\u221e penalties for the\nchoice of \u2126, when varying the maximum number of basis vectors\n(hidden units). Figure 3 indicates test set accuracy when using\noutput layer re\ufb01tting. We also include linear logistic regression,\nkernel SVMs and the Nystr\u00f6m method as baselines. For the latter\ni xj + 1)2. Hyper-parameters\ntwo, we use the quadratic kernel (xT\nare chosen so as to maximize validation set accuracy.\n\nResults. On the vowel (11 classes) and letter (26 classes) datasets,\nl1/l2 and l1/l\u221e penalties outperform l1 norm starting from 20 and\n75 hidden units, respectively. On satimage (6 classes) and segment\n(7 classes), we observed that the three penalties are mostly similar\n(not shown). We hypothesize that l1/l2 and l1/l\u221e penalties make\na bigger difference when the number of classes is large. Multi-\noutput PNs substantially outperform the Nystr\u00f6m method with\ncomparable number of basis vectors (hidden units). Multi-output\nPNs reach the same test accuracy as kernel SVMs with very few\nbasis vectors on vowel and satimage but appear to require at least\n100 basis vectors to reach good performance on letter. This is not\nsurprising, since kernel SVMs require 3,208 support vectors on\nletter, as indicated in Table 2 below.\n\n6.4 Multi-class benchmark comparison\n\nFigure 3: Penalty comparison.\n\nCompared methods. We compare the proposed conditional gradient algorithm with output layer\nre\ufb01tting only and with both output and hidden layer re\ufb01tting; projected gradient descent (FISTA)\n\n7\n\n0.000.250.500.751.00best datarandom initl1 initrandom init+refinel1 init + refine(proposed)satimage0.000.250.500.751.00vowel0.860.880.900.920.94letter050100150Max. hidden units0.500.70Test multi-class accuracy\fTable 2: Muli-class test accuracy and number of basis vectors / support vectors.\n\nsegment\n\nvowel\n\nsatimage\n\nletter\n\nConditional gradient (full re\ufb01tting, proposed)\n\n96.71 (41)\n\nl1\nl1/l2\n96.71 (40)\nl1/l\u221e 96.71 (24)\n\n87.83 (12)\n\n89.80 (25)\n\n92.29 (150)\n\n89.57 (15)\n\n89.08 (18)\n\n91.81 (106)\n\n86.96 (15)\n\n88.99 (20)\n\n92.35 (149)\n\nConditional gradient (output-layer re\ufb01tting, proposed)\n\n97.05 (20)\n\nl1\nl1/l2\n96.36 (21)\nl1/l\u221e 96.19 (16)\n\n80.00 (21)\n\n89.71 (40)\n\n91.01 (139)\n\n85.22 (15)\n\n89.71 (50)\n\n92.24 (150)\n\n86.96 (41)\n\n89.35 (41)\n\n91.68 (128)\n\nProjected gradient descent (random init)\n\n96.88 (50)\n\nl1\nl1/l2\n96.88 (50)\nl1/l\u221e 96.71 (50)\n\n79.13 (50)\n\n89.53 (50)\n\n88.45 (150)\n\n80.00 (48)\n\n89.80 (50)\n\n88.45 (150)\n\n83.48 (50)\n\n89.08 (50)\n\n88.45 (150)\n\nl2\n\n2\n\nBaselines\nLinear\n\n96.88 (50)\n\n81.74 (50)\n\n89.98 (50)\n\n88.45 (150)\n\n92.55\n\n60.00\n\n83.03\n\n71.17\n\nKernelized\n\n96.71 (238)\n\n85.22 (189)\n\n89.53 (688)\n\n93.73 (3208)\n\nOvR PN\n\n94.63\n\n73.91\n\n89.44\n\n75.36\n\nwith random initialization; linear and kernelized models; one-vs-rest PNs (i.e., \ufb01t one PN per class).\nWe focus on PNs rather than FMs since they are known to work better on classi\ufb01cation tasks [5].\n\nResults are included in Table 2. From these results, we can make the following observations and\nconclusions. When using output-layer re\ufb01tting on vowel and letter (two datasets with more than 10\nclasses), group-sparsity inducing penalties lead to better test accuracy. This is to be expected, since\nthese penalties select basis vectors that are useful across all classes. When using full hidden layer and\noutput layer re\ufb01tting, l1 catches up with l1/l2 and l1/l\u221e on the vowel and letter datasets. Intuitively,\nthe basis vector selection becomes less important if we make more effort at every iteration by re\ufb01tting\nthe basis vectors themselves. However, on vowel, l1/l2 is still substantially better than l1 (89.57 vs.\n87.83).\n\nCompared to projected gradient descent with random initialization, our algorithm (for both output\nand full re\ufb01tting) is better on 3/4 (l1), 2/4 (l1/l2) and 3/4 (l1/l\u221e) of the datasets. In addition, with our\nalgorithm, the best model (chosen against the validation set) is substantially sparser. Multi-output\nPNs substantially outperform OvR PNs. This is to be expected, since multi-output PNs learn to share\nbasis vectors across different classes.\n\n6.5 Recommender system experiments using ordinal regression\n\nA straightforward way to implement recommender systems consists in training a single-output model\nto regress ratings from one-hot encoded user and item indices [25]. Instead of a single-output PN\nor FM, we propose to use ordinal McRank, a reduction from ordinal regression to multi-output\nbinary classi\ufb01cation, which is known to achieve good nDCG (normalized discounted cumulative\ngain) scores [19]. This reduction involves training a probabilistic binary classi\ufb01er for each of the m\nrelevance levels (for instance, m = 5 in the MovieLens datasets). The expected relevance of x (e.g.\nthe concatenation of the one-hot encoded user and item indices) is then computed by\n\n\u02c6y =\n\nm\n\nX\n\nc=1\n\nc p(y = c | x) =\n\nm\n\nX\n\nc=1\n\nchp(y \u2264 c | x) \u2212 p(y \u2264 c \u2212 1 | x)i,\n\nwhere we use the convention p(y \u2264 0 | x) = 0. Thus, all we need to do to use ordinal McRank is to\ntrain a probabilistic binary classi\ufb01er p(y \u2264 c | x) for all c \u2208 [m].\nOur key proposal is to use a multi-output model to learn all m classi\ufb01ers simultaneously, i.e., in a\nmulti-task fashion. Let xi be a vector representing a user-item pair with corresponding rating yi, for\n\n8\n\n\fFigure 4: Recommender system experiment: RMSE (lower is better) and nDCG (higher is better).\n\ni \u2208 [n]. We form a n \u00d7 m matrix Y such that yi,c = +1 if yi \u2264 c and \u22121 otherwise, and solve\n\nmin\n\u2126(U )\u2264\u03c4\n\nn\n\nm\n\nXi=1\n\nXc=1\n\n\u2113 yi,c, Xh\u2208H\n\n\u03c3ANOVA(h, xi)uh,c! ,\n\nwhere \u2113 is set to the binary logistic loss, in order to be able to produce probabilities. After running\nAlgorithm 1 on that objective for k iterations, we obtain H \u2208 Rk\u00d7d and V \u2208 Rk\u00d7m. Because H is\nto a single-output regression model, therefore comes from learning V \u2208 Rk\u00d7m instead of v \u2208 Rk.\n\nshared across all outputs, the only small overhead of using the ordinal McRank reduction, compared\n\nIn this experiment, we focus on multi-output factorization machines (FMs), since FMs usually work\nbetter than PNs for one-hot encoded data [5]. We show in Figure 4 the RMSE and nDCG (truncated\nat 1 and 5) achieved when varying k (the maximum number of basis vectors / hidden units).\nResults. When combined with the ordinal McRank reduction, we found that l1/l2 and l1/l\u221e\u2013\nconstrained multi-output FMs substantially outperform single-output FMs and PNs on both RMSE\nand nDCG measures. For instance, on MovieLens 100k and 1M, l1/l\u221e\u2013constrained multi-output\nFMs achieve an nDCG@1 of 0.75 and 0.76, respectively, while single-output FMs only achieve 0.71\nand 0.75. Similar trends are observed with nDCG@5. We believe that this reduction is more robust\nto ranking performance measures such as nDCG thanks to its modelling of the expected relevance.\n\n7 Conclusion and future directions\n\nWe de\ufb01ned the problem of learning multi-output PNs and FMs as that of learning a 3-way tensor\nwhose slices share a common basis. To obtain a convex optimization objective, we reformulated that\nproblem as that of learning an in\ufb01nite but row-wise sparse matrix. To learn that matrix, we developed\na conditional gradient algorithm with corrective re\ufb01tting, and were able to provide convergence\nguarantees, despite the non-convexity of the basis vector (hidden unit) selection step.\n\nAlthough not considered in this paper, our algorithm and its analysis can be modi\ufb01ed to make\nuse of stochastic gradients. An open question remains whether a conditional gradient algorithm\nwith provable guarantees can be developed for training deep polynomial networks or factorization\nmachines. Such deep models could potentially represent high-degree polynomials with few basis\nvectors. However, this would require the introduction of a new functional analysis framework.\n\n9\n\n010203040500.940.960.981.00Movielens 100kRMSE010203040500.680.700.720.740.76nDCG@1010203040500.730.740.750.760.77nDCG@501020304050Max. hidden units0.900.920.940.960.981.00Movielens 1M01020304050Max. hidden units0.720.730.740.750.7601020304050Max. hidden units0.750.760.77Single-output PNSingle-output FMOrdinal McRank FM l1/l2Ordinal McRank FM l1/l\fReferences\n\n[1] F. Bach. Breaking the curse of dimensionality with convex neural networks. JMLR, 2017.\n\n[2] Y. Bengio, N. Le Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In\n\nNIPS, 2005.\n\n[3] D. P. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c Belmont, 1999.\n\n[4] M. Blondel, A. Fujino, and N. Ueda. Convex factorization machines. In ECML/PKDD, 2015.\n\n[5] M. Blondel, M. Ishihata, A. Fujino, and N. Ueda. Polynomial networks and factorization\n\nmachines: New insights and ef\ufb01cient training algorithms. In ICML, 2016.\n\n[6] M. Blondel, K. Seki, and K. Uehara. Block coordinate descent algorithms for large-scale sparse\n\nmulticlass classi\ufb01cation. Machine Learning, 93(1):31\u201352, 2013.\n\n[7] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classi\ufb01ers with online and active\n\nlearning. JMLR, 6(Sep):1579\u20131619, 2005.\n\n[8] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear\n\ninverse problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[9] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing\nlow-degree polynomial data mappings via linear svm. Journal of Machine Learning Research,\n11:1471\u20131490, 2010.\n\n[10] D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In\n\nEMNLP, 2014.\n\n[11] J. C. Dunn and S. A. Harshbarger. Conditional gradient algorithms with open loop step size\n\nrules. Journal of Mathematical Analysis and Applications, 62(2):432\u2013444, 1978.\n\n[12] R.-E. Fan and C.-J. Lin.\n\nhttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/\n\ndatasets/, 2011.\n\n[13] A. Gautier, Q. N. Nguyen, and M. Hein. Globally optimal training of generalized polynomial\n\nneural networks with nonlinear spectral methods. In NIPS, 2016.\n\n[14] GroupLens. http://grouplens.org/datasets/movielens/, 1998.\n\n[15] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.\n\n[16] M. Journ\u00e9e, Y. Nesterov, P. Richt\u00e1rik, and R. Sepulchre. Generalized power method for sparse\n\nprincipal component analysis. Journal of Machine Learning Research, 11:517\u2013553, 2010.\n\n[17] Y. Juan, Y. Zhuang, W.-S. Chin, and C.-J. Lin. Field-aware factorization machines for CTR\n\nprediction. In ACM Recsys, 2016.\n\n[18] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe\n\noptimization for structural SVMs. In ICML, 2012.\n\n[19] P. Li, C. J. Burges, and Q. Wu. McRank: Learning to rank using multiple classi\ufb01cation and\n\ngradient boosting. In NIPS, 2007.\n\n[20] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational ef\ufb01ciency of training neural\n\nnetworks. In NIPS, 2014.\n\n[21] R. Luss and M. Teboulle. Conditional gradient algorithms for rank-one matrix approximations\n\nwith a sparsity constraint. SIAM Review, 55(1):65\u201398, 2013.\n\n[22] S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transac-\n\ntions on Signal Processing, 41(12):3397\u20133415, 1993.\n\n[23] A. Novikov, M. Tro\ufb01mov, and I. Oseledets. Exponential machines.\n\narXiv preprint\n\narXiv:1605.03795, 2016.\n\n10\n\n\f[24] A. Podosinnikova, F. Bach, and S. Lacoste-Julien. Beyond CCA: Moment matching for multi-\n\nview models. In ICML, 2016.\n\n[25] S. Rendle. Factorization machines. In ICDM, 2010.\n\n[26] S. Shalev-Shwartz, A. Gonen, and O. Shamir. Large-scale convex minimization with a low-rank\n\nconstraint. In ICML, 2011.\n\n[27] S. Shalev-Shwartz, Y. Wexler, and A. Shashua. ShareBoost: Ef\ufb01cient multiclass learning with\n\nfeature sharing. In NIPS, 2011.\n\n[28] S. Sonnenburg and V. Franc. Cof\ufb01n: A computational framework for linear SVMs. In ICML,\n\n2010.\n\n[29] E. Stoudenmire and D. J. Schwab. Supervised learning with tensor networks. In NIPS, 2016.\n\n[30] Z. Wang, K. Crammer, and S. Vucetic. Multi-class Pegasos on a budget. In ICML, 2010.\n\n[31] M. Yamada, W. Lian, A. Goyal, J. Chen, K. Wimalawarne, S. A. Khan, S. Kaski, H. M.\nMamitsuka, and Y. Chang. Convex factorization machine for toxicogenomics prediction. In\nKDD, 2017.\n\n[32] E. Zhong, Y. Shi, N. Liu, and S. Rajan. Scaling factorization machines with parameter server.\n\nIn CIKM, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1906, "authors": [{"given_name": "Mathieu", "family_name": "Blondel", "institution": "NTT"}, {"given_name": "Vlad", "family_name": "Niculae", "institution": "Cornell University"}, {"given_name": "Takuma", "family_name": "Otsuka", "institution": "NTT Communication Science Labs"}, {"given_name": "Naonori", "family_name": "Ueda", "institution": "NTT Communication Science Laboratories/ RIKEN AIP"}]}