{"title": "Dual Decomposed Learning with Factorwise Oracle for Structural SVM of Large Output Domain", "book": "Advances in Neural Information Processing Systems", "page_first": 5030, "page_last": 5038, "abstract": "Many applications of machine learning involve structured output with large domain, where learning of structured predictor is prohibitive due to repetitive calls to expensive inference oracle. In this work, we show that, by decomposing training of Structural Support Vector Machine (SVM) into a series of multiclass SVM problems connected through messages, one can replace expensive structured oracle with Factorwise Maximization Oracle (FMO) that allows efficient implementation of complexity sublinear to the factor domain. A Greedy Direction Method of Multiplier (GDMM) algorithm is proposed to exploit sparsity of messages which guarantees $\\epsilon$ sub-optimality after $O(log(1/\\epsilon))$ passes of FMO calls. We conduct experiments on chain-structured problems and fully-connected problems of large output domains. The proposed approach is orders-of-magnitude faster than the state-of-the-art training algorithms for Structural SVM.", "full_text": "Dual Decomposed Learning with Factorwise Oracles\n\nfor Structural SVMs of Large Output Domain\n\nIan E.H. Yen \u2020 Xiangru Huang \u2021 Kai Zhong \u2021 Ruohan Zhang \u2021\n\nInderjit S. Dhillon \u2021\n\u2021 University of Texas at Austin\n\nPradeep Ravikumar \u2020\n\n\u2020 Carnegie Mellon University\n\nAbstract\n\nMany applications of machine learning involve structured outputs with large do-\nmains, where learning of a structured predictor is prohibitive due to repetitive\ncalls to an expensive inference oracle. In this work, we show that by decomposing\ntraining of a Structural Support Vector Machine (SVM) into a series of multiclass\nSVM problems connected through messages, one can replace an expensive struc-\ntured oracle with Factorwise Maximization Oracles (FMOs) that allow ef\ufb01cient\nimplementation of complexity sublinear to the factor domain. A Greedy Direction\nMethod of Multiplier (GDMM) algorithm is then proposed to exploit the sparsity\nof messages while guarantees convergence to \u0001 sub-optimality after O(log(1/\u0001))\npasses of FMOs over every factor. We conduct experiments on chain-structured\nand fully-connected problems of large output domains, where the proposed ap-\nproach is orders-of-magnitude faster than current state-of-the-art algorithms for\ntraining Structural SVMs.\n\n1\n\nIntroduction\n\nStructured prediction has become prevalent with wide applications in Natural Language Process-\ning (NLP), Computer Vision, and Bioinformatics to name a few, where one is interested in outputs\nof strong interdependence. Although many dependency structures yield intractable inference prob-\nlems, approximation techniques such as convex relaxations with theoretical guarantees [10, 14, 7]\nhave been developed. However, solving the relaxed problems (LP, QP, SDP, etc.) is computationally\nexpensive for factor graphs of large output domain and results in prohibitive training time when em-\nbedded into a learning algorithm relying on inference oracles [9, 6]. For instance, many applications\nin NLP such as Machine Translation [3], Speech Recognition [21], and Semantic Parsing [1] have\noutput domains as large as the size of vocabulary, for which the prediction of even a single sentence\ntakes considerable time.\nOne approach to avoid inference during training is by introducing a loss function conditioned on\nthe given labels of neighboring output variables [15]. However, it also introduces more variance\nto the estimation of model and could degrade testing performance signi\ufb01cantly. Another thread of\nresearch aims to formulate parameter learning and output inference as a joint optimization problem\nthat avoids treating inference as a subroutine [12, 11]. In this appraoch, the structured hinge loss\nis reformulated via dual decomposition, so both messages between factors and model parameters\nare treated as \ufb01rst-class variables. The new formulation, however, does not yield computational ad-\nvantage due to the constraints entangling the two types of variables. In particular, [11] employs a\nhybrid method (DLPW) that alternatingly optimizes model parameters and messages, but the algo-\nrithm is not signi\ufb01cantly faster than directly performing stochastic gradient on the structured hinge\nloss. More recently, [12] proposes an approximate objective for structural SVMs that leads to an\nalgorithm considerably faster than DLPW on problems requiring expensive inference. However, the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: (left) Factors with large output domains in Sequence Labeling. (right) Large number of\nfactors in a Correlated Multilabel Prediction problem. Circles denote variables and black boxes\ndenote factors. (Yu: domain of unigram factor. Yb: domain of bigram factor.)\n\napproximate objective requires a trade-off between ef\ufb01ciency and approximation quality, yielding\nan O(1/\u00012) overall iteration complexity for achieving \u0001 sub-optimality.\nThe contribution of this work is twofold. First, we propose a Greedy Direction Method of Multiplier\n(GDMM) algorithm that decomposes the training of a structural SVM into factorwise multiclass\nSVMs connected through sparse messages con\ufb01ned to the active labels. The algorithm guarantees\nan O(log(1/\u0001)) iteration complexity for achieving an \u0001 sub-optimality and each iteration requires\nonly one pass of Factorwise Maximization Oracles (FMOs) over every factor. Second, we show that\nthe FMO can be realized in time sublinear to the cardinality of factor domains, hence is consider-\nably more ef\ufb01cient than a structured maximization oracle when it comes to large output domain.\nFor problems consisting of numerous binary variables, we further give realization of a joint FMO\nthat has complexity sublinear to the number of factors. We conduct experiments on both chain-\nstructured problems that allow exact inference and fully-connected problems that rely on Linear\nProgram relaxations, where we show the proposed approach is orders-of-magnitude faster than cur-\nrent state-of-the-art training algorithms for Structured SVMs.\n\n2 Problem Formulation\nStructured prediction aims to predict a set of outputs y \u2208 Y(x) from their interdependency\nand inputs x \u2208 X . Given a feature map \u03c6(x, y) : X \u00d7 Y(x) \u2192 Rd that extracts rele-\nvant information from (x, y), a linear classi\ufb01er with parameters w can be de\ufb01ned as h(x; w) =\narg maxy\u2208Y(x) (cid:104)w, \u03c6(x, y)(cid:105), where we estimate the parameters w from a training set D =\n{(xi, \u00afyi)}n\n\ni=1 by solving a regularized Empirical Risk Minimization (ERM) problem\n\nmin\nw\n\n(cid:107)w(cid:107)2 + C\n\n1\n2\n\nL(w; xi, \u00afyi) .\n\nn(cid:88)\n\ni=1\n\nIn case of a Structural SVM [19, 20], we consider the structured hinge loss\n\nL(w; x, \u00afy) = max\ny\u2208Y(x)\n\n(cid:104)w, \u03c6(x, y) \u2212 \u03c6(x, \u00afy)(cid:105) + \u03b4(y, \u00afy),\n\n(1)\n\n(2)\n\n(3)\n\nwhere \u03b4(y, \u00afyi) is a task-dependent error function, for which the Hamming distance \u03b4H (y, \u00afyi) is\ncommonly used. Since the size of domain |Y(x)| typically grows exponentially with the num-\nber of output variables, the tractability of problem (1) lies in the decomposition of the responses\n(cid:104)w, \u03c6(x, y)(cid:105) into several factors, each involving only a few outputs. The factor decomposition can\nbe represented as a bipartite graph G(F,V,E) between factors F and variables V, where an edge\n(f, j) \u2208 E exists if the factor f involves the variable j. Typically, a set of factor templates T exists\nso that factors of the same template F \u2208 T share the same feature map \u03c6F (.) and parameter vector\nwF . Then the response on input-output pair (x, y) is given by\n\n(cid:104)w, \u03c6(x, y)(cid:105) =\n\n(cid:104)wF , \u03c6F (xf , yf )(cid:105),\n\n(cid:88)\n\n(cid:88)\n\nF\u2208T\n\nf\u2208F (x)\n\nwhere F (x) denotes the set of factors on x that share a template F , and yf denotes output variables\nrelevant to factor f of domain Yf = YF . We will use F(x) to denote the union of factors of\ndifferent templates {F (x)}F\u2208T . Figure 1 shows two examples that both have two factor templates\n\n2\n\n\f(i.e. unigram and bigram) for which the responses have decomposition(cid:80)\n(cid:80)\nf\u2208u(x)(cid:104)wu, \u03c6u(xf , yf )(cid:105)+\nf\u2208b(x)(cid:104)wb, \u03c6b(yf )(cid:105). Unfortunately, even with such decomposition, the maximization in (2) is\nstill computationally expensive. First, most of graph structures do not allow exact maximization,\nso in practice one would minimize an upper bound of the original loss (2) obtained from relaxation\n[10, 18]. Second, even for the relaxed loss or a tree-structured graph that allows polynomial-time\nmaximization, its complexity is at least linear to the cardinality of factor domain |Yf| times the\nnumber of factors |F|. This results in a prohibitive computational cost for problems with large output\ndomain. As in Figure 1, one example has a factor domain |Yb| which grows quadratically with the\nsize of output domain; the other has the number of factors |F| which grows quadratically with the\nnumber of outputs. A key observation of this paper is, in contrast to the structural maximization\n(2) that requires larger extent of exploration on locally suboptimal assignments in order to achieve\nglobal optimality, the Factorwise Maximization Oracle (FMO)\n\ny\u2217\nf := argmax\n\n(cid:104)wF , \u03c6(xf , yf )(cid:105)\n\nyf\n\n(4)\n\ncan be realized in a more ef\ufb01cient way by maintaining data structures on the factor parameters wF .\nIn the next section, we develop globally-convergent algorithms that rely only on FMO, and provide\nrealizations of message-augmented FMO with cost sublinear to the size of factor domain or to the\nnumber of factors.\n\n3 Dual-Decomposed Learning\n\n(5)\n\n(cid:11)\n\nyf\u2208Yf\n\nf\u2208F (x)\n\n(q,p)\u2208ML\n\nML :=\n\nWe consider an upper bound of the loss (2) based on a Linear Program (LP) relaxation that is tight\nin case of a tree-structured graph and leads to a tractable approximation for general factor graphs\n[11, 18]:\n\n(cid:10)\u03b8f (w), qf\nwhere \u03b8f (w) :=(cid:0)(cid:10)wF , \u03c6F (xf , yf ) \u2212 \u03c6F (xf , \u00afyf )(cid:11) + \u03b4f (yf , \u00afyf )(cid:1)\n\n. ML is a polytope that\n\nLLP (w; x, \u00afy) = max\n\n(cid:27)\nconstrains qf in a |Yf|-dimensional simplex \u2206|Yf| and also enforces local consistency:\n\n(cid:26) q = (qf )f\u2208F (x)\n\n\u2200f \u2208 F (x),\u2200F \u2208 T\n\n(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12) qf \u2208 \u2206|Yf|,\n\nMjf qf = pj, \u2200(j, f ) \u2208 E(x)\n\nwhere Mjf is a |Yj| by |Yf| matrix that has Mjf (yj, yf ) = 1 if yj is consistent with yf (i.e.\nyj = [yf ]j) and Mjf (yj, yf ) = 0 otherwise. For a tree-structured graph G(F,V,E), the LP\nrelaxation is tight and thus loss (5) is equivalent to (2). For a general factor graph, (5) is an upper\nbound on the original loss (2). It is observed that parameters w learned from the upper bound (5)\ntend to tightening the LP relaxation and thus in practice lead to tight LP in the testing phase [10].\nInstead of solving LP (5) as a subroutine, a recent attempt formulates (1) as a problem that optimizes\n(p, q) and w jointly via dual decomposition [11, 12]. We denote \u03bbjf as dual variables associated\nwith constraint Mjf qf = pj, and \u03bbf := (\u03bbjf )j\u2208N (f ) where N (f ) = {j | (j, f ) \u2208 E}. We have\nLLP (w; x, \u00afy) = max\nq,p\n\n(cid:104)\u03bbjf , Mjf qf \u2212 pj(cid:105)\n\n(cid:88)\n\n(cid:88)\n\np = (pj)j\u2208V(x)\n\nmin\n\n(6)\n\n,\n\n\u03bb\n\n(cid:88)\n(cid:88)\n\n= min\n\u03bb\u2208\u039b\n\nf\u2208F (x)\n\n= min\n\u03bb\u2208\u039b\n\n|Yf |\n\nmax\nqf\u2208\u2206\n\n\uf8eb\uf8ed max\n\nyf\u2208Yf\n\nf\u2208F (x)\n\n(\u03b8f (w) +\n\nj\u2208N (f )\njf \u03bbjf )T qf\n\n(cid:104)\u03b8f (w), qf(cid:105) +\n(cid:88)\n(cid:88)\n\nj\u2208N (f )\n\nM T\n\n\u03b8f (yf ; w) +\n\n\u03bbjf ([yf ]j)\n\nLf (w; xf , \u00afyf , \u03bbf ) (8)\n\nf\u2208F (x)\n\nf\u2208F (x)\n(j,f )\u2208E(x) \u03bbjf = 0,\u2200j \u2208 V(x)\nwhere (7) follows the strong duality, and the domain \u039b =\nfollows the maximization w.r.t. p in (6). The result (8) is a loss function Lf (.) that penalizes the\nresponse of each factor separately given \u03bbf . The ERM problem (1) can then be expressed as\n\nj\u2208N (f )\n\n\u03bb\n\nmin\nw,\u03bb\u2208\u039b\n\n(cid:107)wF(cid:107)2 + C\n\nLf (wF ; xf , \u00afyf , \u03bbf )\n\n\uf8eb\uf8ed 1\n\n2\n\n(cid:88)\n\nF\u2208T\n\n(cid:88)\n\nf\u2208F\n\n3\n\n(cid:88)\n\n\uf8f6\uf8f8 = min\n(cid:12)(cid:12)(cid:12)(cid:80)\n(cid:110)\n\n\u03bb\u2208\u039b\n\n\uf8f6\uf8f8 ,\n\n(7)\n\n(cid:111)\n\n(9)\n\n\f(cid:107)wF (\u03b1)(cid:107)2 \u2212(cid:88)\n\n\u03b4T\nj \u03b1j\n\nmin\n\u03b1f\u2208\u2206\n\n|Yf |\n\ns.t.\n\nG(\u03b1) :=\nj\u2208V\nMjf \u03b1f = \u03b1j, j \u2208 N (f ), f \u2208 F.\nwF (\u03b1) =\n\n\u03a6T\n\nf \u03b1f\n\n1\n2\n\n(cid:88)\n(cid:88)\n\nF\u2208T\n\nf\u2208F\n\nwhere \u03b1f lie in the shifted simplex\n\n(cid:26)\n\n\u2206|Yf| :=\n\n\u03b1f\n\n(cid:12)(cid:12) \u03b1f (\u00afyf ) \u2264 C , \u03b1f (yf ) \u2264 0, \u2200yf (cid:54)= \u00afyf ,\n\n(10)\n\n(11)\n\n(cid:27)\n\n\u03b1f (yf ) = 0.\n\n.\n\n(cid:88)\n\nyf\u2208Yf\n\nAlgorithm 1 Greedy Direction Method of Multiplier\n\n0. Initialize t = 0, \u03b10 = 0, \u03bb0 = 0 and A0 = Ainit.\nfor t = 0, 1, ... do\n\n1. Compute (\u03b1t+1,At+1) via one pass of Algorithm 2, 3, or 4.\n, j \u2208 N (f ), \u2200f \u2208 F.\n2. \u03bbt+1\n\nf \u2212 \u03b1t+1\n\njf = \u03bbt\n\nMjf \u03b1t+1\n\njf + \u03b7\n\nj\n\n(cid:16)\n\n(cid:17)\n\nend for\n\nwhere F = (cid:83)N\n\ni=1 F (xi) and F = (cid:83)\n\nF\u2208T F . The formulation (9) has an insightful interpretation:\neach factor template F learns a multiclass SVM given by parameters wF from factors f \u2208 F , while\neach factor is augmented with messages \u03bbf passed from all variables related to f.\nDespite the insightful interpretation, formulation (9) does not yield computational advantage di-\nrectly. In particular, the non-smooth loss Lf (.) entangles parameters w and messages \u03bb, which\nleads to a dif\ufb01cult optimization problem. Previous attempts to solve (9) either have slow conver-\ngence [11] or rely on an approximation objective [12]. In the next section, we propose a Greedy\nDirection Method of Multiplier (GDMM) algorithm for solving (9), which achieves \u0001 sub-optimality\nin O(log(1/\u0001)) iterations while requiring only one pass of FMOs for each iteration.\n\n3.1 Greedy Direction Method of Multiplier\nLet \u03b1f (yf ) be dual variables for the factor responses zf (yf ) = (cid:104)w, \u03c6(xf , yf )(cid:105) and {\u03b1j}j\u2208V be\nthat for constraints in \u039b. The dual problem of (9) can be expressed as 1\n\nProblem (10) can be interpreted as a summation of the dual objectives of |T | multiclass SVMs\n(each per factor template), connected with consistency constraints. To minimize (10) one factor at a\ntime, we adopt a Greedy Direction Method of Multiplier (GDMM) algorithm that alternates between\nminimizing the Augmented Lagrangian function\n\n(cid:88)\n\n(cid:13)(cid:13)mjf (\u03b1, \u03bbt)(cid:13)(cid:13)2 \u2212 (cid:107)\u03bbt\n\njf(cid:107)2\n\n(12)\n\nmin\n\u03b1f\u2208\u2206\n\n|Yf |\n\nL(\u03b1, \u03bbt) := G(\u03b1) +\n\n\u03c1\n2\n\nj\u2208N (f ) ,f\u2208F\n\nand updating the Lagrangian Multipliers (of consistency constraints)\n\njf = \u03bbt\n\u03bbt+1\n\njf + \u03b7 (Mjf \u03b1f \u2212 \u03b1j) . \u2200j \u2208 N (f ), f \u2208 F,\n\n(13)\njf plays the role of messages between |T | multiclass\nwhere mjf (\u03b1, \u03bbt) = Mjf \u03b1f \u2212 \u03b1j + \u03bbt\nproblems, and \u03b7 is a constant step size. The procedure is outlined in Algorithm 1. The minimization\n(12) is conducted in an approximate and greedy fashion, in the aim of involving as few dual variables\nas possible. We discuss two greedy algorithms that suit two different cases in the following.\nFactor of Large Domain For problems with large factor domains, we minimize (12) via a variant\nof Frank-Wolfe algorithm with away steps (AFW) [8], outlined in Algorithm 2. The AFW algorithm\nmaintains the iterate \u03b1t as a linear combination of bases constructed during iterates\n\n\u03b1t =\n\nvv, At := {v | ct\nct\n\nv (cid:54)= 0}\n\n(14)\n\n(cid:88)\n\nv\u2208At\n\n1\u03b1j is also dual variables for responses on unigram factors. We de\ufb01ne U := V and \u03b1f := \u03b1j, \u2200f \u2208 U.\n\n4\n\n\fAlgorithm 2 Away-step Frank-Wolfe (AFW)\n\nAlgorithm 3 Block-Greedy Coordinate Descent\n\nrepeat\n\n1. Find a greedy direction v+ satisfying (15).\n2. Find an away direction v\u2212 satisfying (16).\n3. Compute \u03b1t+1 according to (17).\n4. Maintain active set At by (14).\n5. Maintain wF (\u03b1) according to (10).\n\nuntil a non-drop step is performed.\n\nfor i \u2208 [n] do\n\ni \u222a {f\u2217}.\n\n1. Find f\u2217 satisfying (18) for i-th sample.\ni = As\n2. As+1\nfor f \u2208 Ai do\n3.1 Update \u03b1f according to (19).\n3.2 Maintain wF (\u03b1) according to (10).\n\nend for\n\nend for\n\nv\u2212 := argmax\nv\u2208At\n\n(cid:26) \u03b1t + \u03b3F dF ,\n\nwhere At maintains an active set of bases of non-zero coef\ufb01cients. Each iteration of AFW \ufb01nds\na direction v+ := (v+\nf )f\u2208F leading to the most descent amount according to the current gradient,\nsubject to the simplex constraints:\nv+\nf := argmin\n|Yf |\n\n(cid:104)\u2207\u03b1fL(\u03b1t, \u03bbt), vf(cid:105) = C(e\u00afyf \u2212 ey\u2217\n\n), \u2200f \u2208 F\n\n(15)\n\nf\n\nvf\u2208\u2206\n\nwhere y\u2217\nf of highest response. In addition, AFW \ufb01nds the away direction\n\nf := arg maxyf\u2208Yf\\{\u00afyf} (cid:104)\u2207\u03b1fL(\u03b1t, \u03bbt), eyf(cid:105) is the non-ground-truth labeling of factor\n\n(cid:104)\u2207\u03b1L(\u03b1t, \u03bbt), v(cid:105),\n\n(16)\n\nwhich corresponds to the basis that leads to the most descent amount when being removed. Then\nthe update is determined by\n\n(cid:104)\u2207\u03b1L, dF(cid:105) < (cid:104)\u2207\u03b1L, dA(cid:105)\notherwise.\n\n\u03b1t+1 :=\n\n\u03b1t + \u03b3AdA,\n\n(17)\nwhere we choose between two descent directions dF := v+\u2212 \u03b1t and dA := \u03b1t\u2212 v\u2212. The step size\nof each direction \u03b3F := arg min\u03b3\u2208[0,1] L(\u03b1t + \u03b3dF ) and \u03b3A := arg min\u03b3\u2208[0,cv\u2212 ] L(\u03b1t + \u03b3dA)\ncan be computed exactly due to the quadratic nature of (12). A step is called drop step if a step size\n\u03b3\u2217 = cv\u2212 is chosen, which leads to the removal of a basis v\u2212 from the active set, and therefore\nthe total number of drop steps can be bounded by half of the number of iterations t. Since a drop\nstep could lead to insuf\ufb01cient descent, Algorithm 2 stops only if a non-drop step is performed. Note\nAlgorithm 2 requires only a factorwise greedy search (15) instead of a structural maximization (2).\nIn section 3.2 we show how the factorwise search can be implemented much more ef\ufb01ciently than\nstructural ones. All the other steps (2-5) in Algorithm 2 can be computed in O(|Af|nnz(\u03c6f )),\nwhere |Af| is the number of active states in factor f, which can be much smaller than |Yf| when\noutput domain is large.\nIn practice, a Block-Coordinate Frank-Wolfe (BCFW) method has much faster convergence than\nFrank-Wolfe method (Algorithm 2) [13, 9], but proving linear convergence for BCFW is also much\nmore dif\ufb01cult [13], which prohibits its use in our analysis. In our implementation, however, we adopt\nthe BCFW version since it turns out to be much more ef\ufb01cient. We include a detailed description on\nthe BCFW version in Appendix-A (Algorithm 4).\n\nLarge Number of Factors Many structured prediction problems, such as alignment, segmenta-\ntion, and multilabel prediction (Fig. 1, right), comprise binary variables and large number of factors\nwith small domains, for which Algorithm 2 does not yield any computational advantage. For this\ntype of problem, we minimize (12) via one pass of Block-Greedy Coordinate Descent (BGCD) (Al-\ngorithm 3) instead. Let Qmax be an upper bound on the eigenvalue of Hessian matrix of each block\nL(\u03b1). For binary variables of pairwise factor, we have Qmax=4(maxf\u2208F (cid:107)\u03c6f(cid:107)2 + 1). Each\n\u22072\niteration of BGCD \ufb01nds a factor that leads to the most progress\n\n\u03b1f\n\nf\u2217 := argmin\nf\u2208F (xi)\n\n(18)\nfor each instance xi, adds them into the set of active factors Ai, and performs updates by solving\nblock subproblems\n\nmin\n\u03b1f +d\u2208\u2206\n\n|Yf |\n\n2\n\n.\n\n(cid:104)\u2207\u03b1fL(\u03b1t, \u03bbt), d(cid:105) +\n\nQmax\n\n(cid:107)d(cid:107)2\n\n(cid:18)\n\n(cid:19)\n\n\u2217\nd\nf = argmin\n\n\u03b1f +d\u2208\u2206\n\n|Yf |\n\n(cid:104)\u2207\u03b1fL(\u03b1t, \u03bbt), d(cid:105) +\n\nQmax\n\n2\n\n(cid:107)d(cid:107)2\n\n(19)\n\n5\n\n\ffor each factor f \u2208 Ai. Note |Ai| is bounded by the number of GDMM iterations and it converges\nto a constant much smaller than |F(xi)| in practice. We address in the next section how a joint FMO\ncan be performed to compute (18) in time sublinear to |F(xi)| in the binary-variable case.\n\n3.2 Greedy Search via Factorwise Maximization Oracle (FMO)\n\nThe main difference between the FMO and structural maximization oracle (2) is that the former\ninvolves only simple operations such as inner products or table look-ups for which one can easily\ncome up with data structures or approximation schemes to lower the complexity. In this section,\nwe present two approaches to realize sublinear-time FMOs for two types of factors widely used in\npractice. We will describe in terms of pairwise factors, but the approach can be naturally generalized\nto factors involving more variables.\n\nIndicator Factor Factors \u03b8f (xf , yf ) of the form\n\n(cid:104)wF , \u03c6F (xf , yf )(cid:105) = v(xf , yf )\n\n(20)\n\nare widely used in practice. It subsumes the bigram factor v(yi, yj) that is prevalent in sequence,\ngrid, and network labeling problems, and also factors that map an input-output pair (x, y) directly\nto a score v(x, y). For this type of factor, one can maintain ordered multimaps for each factor\ntemplate F , which support ordered visits of {v(x, (yi, yj))}(yi,yj )\u2208Yf , {v(x, (yi, yj))}yj\u2208Yj and\n{v(x, (yi, yj))}yi\u2208Yi. Then to \ufb01nd yf that maximizes (26), we compare the maximizers in 4 cases:\n(i) (yi, yj) : mif (yi) = mjf (yj) = 0, (ii) (yi, yj) : mif (yi) = 0, (iii) (yi, yj) : mjf (yj) = 0,\n(iv) (yi, yj) : mjf (yj) (cid:54)= 0, mif (yi) (cid:54)= 0. The maximization requires O(|Ai||Aj|) in cases (ii)-(iv)\nand O(max(|Ai||Yj|,|Yi||Aj|)) in case (i) (see details in Appendix C-1). However, in practice we\nobserve an O(1) cost for case (i) and the bottleneck is actually case (iv), which requires O(|Ai||Aj|).\nNote the ordered multimaps need maintenance whenever the vector wF (\u03b1) is changed. Fortunately,\nf\u2208F,xf =x \u03b1f (yf ), each update (25) leads to at most\n|Af| changed elements, which gives a maintenance cost bounded by O(|Af| log(|YF|)). On the\nother hand, the space complexity is bounded by O(|YF||XF|) since the map is shared among factors.\n\nsince the indicator factor has v(yf , x) = (cid:80)\n\nBinary-Variable Interaction Factor Many problems consider pairwise-interaction factor be-\ntween binary variables, where the factor domain is small but the number of factors is large. For\nf \u2208 YF . We call factors exhibiting such\nthis type of problem, there is typically an rare outcome yA\noutcome as active factors and the score of a labeling is determined by the score of the active factors\n(inactive factors give score 0). For example, in the problem of multilabel prediction with pairwise\ninteractions (Fig. 1, right), an active unigram factor has outcome yA\nj = 1 and an active bigram\nfactor has yA\nFor this type of problem, we show that the gradient magnitude w.r.t. \u03b1f for a bigram factor f can be\ndetermined by the gradient w.r.t. \u03b1f (yA\nf ) when one of its incoming message mjf or mif is 0 (see\ndetails in Appendix C-2). Therefore, we can \ufb01nd the greedy factor (18) by maintaining an ordered\nf , xf )}f\u2208F . The resulting complexity\nf in each factor {v(yA\nmultimap for the scores of outcome yA\nfor \ufb01nding a factor that maximizes (18) is then reduced from O(|Yi||Yj|) to O(|Ai||Aj|), where the\nlatter is for comparison among factors that have both messages mif and mjf being non-zero.\n\nf = (1, 1), and each sample typically has only few outputs with value 1.\n\nInner-Product Factor We consider another widely-used type of factor of the form\n\n\u03b8f (xf , yf ) = (cid:104)wF , \u03c6F (xf , yf )(cid:105) = (cid:104)wF (yf ), \u03c6F (xf )(cid:105)\n\nwhere all labels yf \u2208 Yf share the same feature mapping \u03c6F (xf ) but with different parameters\nwF (yf ). We propose a simple sampling approximation method with a performance guarantee for\nthe convergence of GDMM. Note although one can apply similar sampling schemes to the structural\nmaximization oracle (2), it is hard to guarantee the quality of approximation. The sampling method\n, and realizes an approximate FMO\n\ndivides Yf into \u03bd mutually exclusive subsets Yf =(cid:83)\u03bd\n\nby \ufb01rst sampling k uniformly from [\u03bd] and returning\n\nk=1 Y (k)\n\nf\n\n\u02c6yf \u2208 arg max\nyf\u2208Y (k)\n\nf\n\n(cid:104)wF (yf ), \u03c6F (xf )(cid:105).\n\n(21)\n\n6\n\n\ff\n\nNote there is at least 1/\u03bd probability that \u02c6yf \u2208 arg maxyf\u2208Yf (cid:104)wF (yf ), \u03c6F (xf )(cid:105) since at least\none partition Y (k)\ncontains a label of the highest score. In section 3.3, we show that this approximate\nFMO still ensures convergence with a rate scaled by 1/\u03bd. In practice, since the set of active labels\nis not changing frequently during training, once an active label yf is sampled, it will be kept in the\nactive set Af till the end of the algorithm and thus results in a convergence rate similar to that of\nan exact FMO. Note for problems of binary variables with large number of inner-product factors,\nand\n\nthe sampling technique applies similarly by simply partitioning factors as Fi = (cid:83)\u03bd\n\nk=1 F (k)\n\nsearching active factors only within one randomly chosen partition at a time.\n\ni\n\n3.3 Convergence Analysis\n\nWe show the iteration complexity of the GDMM algorithm with an 1/\u03bd-approximated FMO given\nin section 3.2. The convergence guarantee for exact FMOs can be obtained by setting \u03bd = 1. The\nanalysis leverages recent analysis on the global linear convergence of Frank-Wolfe variants [8] for\nfunction of the form (12) with a polyhedral domain, and also the analysis in [5] for Augmented\nLagrangian based method. This type of greedy Augmented Lagrangian Method was also analyzed\npreviously under different context [23, 24, 22].\nLet d(\u03bb) = min\u03b1 L(\u03b1, \u03bb) be the dual objective of (12), and let\n\nd := d\u2217 \u2212 d(\u03bbt), \u2206t\n\u2206t\n\np := L(\u03b1t, \u03bbt) \u2212 d(\u03bbt)\n\n(22)\nbe the dual and primal suboptimality of problem (10) respectively. We have the following theorems.\nTheorem 1 (Convergence of GDMM with AFW). The iterates {(\u03b1t, \u03bbt)}\u221e\nt=1 produced by Algo-\nrithm 1 with step 1 performed by Algorithm 2 has\n\nE[\u2206t\n\np + \u2206t\n\nd] \u2264 \u0001 for t \u2265 \u03c9 log(\n\n1\n\u0001\n\n)\n\n(cid:110)\n\n(cid:111)\n\n\u03c1\n\n4+16(1+\u03bd)mQ/\u00b5M with \u03c9 = max\n\nfor any 0 < \u03b7 \u2264\n, where \u00b5M is the\ngeneralized geometric strong convexity constant of (12), Q is the Lipschitz-continuous constant for\nthe gradient of objective (12), and \u03c4 > 0 is a constant depending on optimal solution set.\nTheorem 2 (Convergence of GDMM with BGCD). The iterates {(\u03b1t, \u03bbt)}\u221e\nrithm 1 with step 1 performed by Algorithm 3 has\n\nt=1 produced by Algo-\n\n2(1 + 4 mQ(1+\u03bd)\n\n\u00b5M ), \u03c4\n\n\u03b7\n\n(23)\n\n(24)\n\nE[\u2206t\n\np + \u2206t\n\n(cid:110)\n\nd] \u2264 \u0001 for t \u2265 \u03c91 log(\n2(1 + Qmax\u03bd\n\n)\n\n1\n\u0001\n), \u03c4\n\u03b7\n\n(cid:111)\n\nfor any 0 < \u03b7 \u2264\n, where \u00b51 is the generalized\nstrong convexity constant of objective (12) and Qmax = maxf\u2208F Qf is the factorwise Lipschitz-\ncontinuous constant on the gradient.\n\n4(1+Qmax\u03bd/\u00b51) with \u03c91 = max\n\n\u00b51\n\n\u03c1\n\n4 Experiments\n\nIn this section, we compare with existing approaches on Sequence Labeling and Multi-label predic-\ntion with pairwise interaction. The algorithms in comparison are: (i) BCFW: a Block-Coordinate\nFrank-Wolfe method based on structural oracle [9], which outperforms other competitors such as\nCutting-Plane, FW, and online-EG methods in [9]. (ii) SSG: an implementation of the Stochastic\nSubgradient method [16]. (iii) Soft-BCFW: Algorithm proposed in ([12]), which avoids structural\noracle by minimizing an approximate objective, where a parameter \u03c1 controls the precision of the\napproximation. We tuned the parameter and chose two of the best on the \ufb01gure. For BCFW and\nSSG, we adapted the MATLAB implementation provided by authors of [9] into C++, which is an\norder of magnitude faster. All other implementations are also in C++. The results are compared in\nterms of primal objective (achieved by w) and test accuracy.\nOur experiments are conducted on 4 public datasets: POS, ChineseOCR, RCV1-regions, and EUR-\nLex (directory codes). For sequence labeling we experiment on POS and ChineseOCR. The POS\ndataset is a subset of Penn treebank2 that contains 3,808 sentences, 196,223 words, and 45 POS la-\nbels. The HIT-MW3 ChineseOCR dataset is a hand-written Chinese character dataset from [17]. The\n\n2https://catalog.ldc.upenn.edu/LDC99T42\n3https://sites.google.com/site/hitmwdb/\n\n7\n\n\fFigure 2: (left) Compare two FMO-based algorithms (GDMM, Soft-BCFW) in number of iterations.\n(right) Improvement in training time given by sublinear-time FMO.\n\nFigure 3: Primal Objective v.s. Time and Test error v.s. Time plots. Note that \ufb01gures of objective\nhave showed that SSG converges to a objective value much higher than all other methods, this is also\nobserved in [9]. Note the training objective for the EUR-Lex data set is too expensive to compute\nand we are unable to plot the \ufb01gure.\n\ndataset has 12,064 hand-written sentences, and a total of 174,074 characters. The vocabulary (label)\nsize is 3,039. For the Correlated Multilabel Prediction problems, we experiment on two benchmark\ndatasets RCV1-regions4 and EUR-Lex (directory codes)5. The RCV1-regions dataset has 228 labels,\n23,149 training instances and 47,236 features. Note that a smaller version of RCV1 with only 30\nlabels and 6000 instances is used in [11, 12]. EUR-Lex (directory codes) has 410 directory codes as\nlabels with a sample size of 19,348. We \ufb01rst compare GDMM (without subFMO) with Soft-BCFW\nin Figure 2. Due to the approximation (controlled by \u03c1), Soft-BCFW can converge to a subopti-\nmal primal objective value. While the gap decreases as \u03c1 increases, its convergence becomes also\nslower. GDMM, on the other hand, enjoys a faster convergence. The sublinear-time implementation\nof FMO also reduces the training time by an order of magnitude on the ChineseOCR data set, as\nshowed in Figure 2 (right). More general experiments are showed in Figure 3. When the size of out-\nput domain is small (POS dataset), GDMM-subFMO is competitive to other solvers. As the size of\noutput domain grows (ChineseOCR, RCV1, EUR-Lex), the complexity of structural maximization\noracle grows linearly or even quadratically, while the complexity of GDMM-subFMO only grows\nsublinearly in the experiments. Therefore, GDMM-subFMO achieves orders-of-magnitude speedup\nover other methods. In particular, when running on ChineseOCR and EUR-Lex, each iteration of\nSSG, GDMM, BCFW and Soft-BCFW take over 103 seconds, while it only takes a few seconds in\nGDMM-subFMO.\n\nAcknowledgements. We acknowledge the support of ARO via W911NF-12-1-0390, NSF via\ngrants CCF-1320746, CCF-1117055, IIS-1149803, IIS-1546452, IIS-1320894, IIS-1447574, IIS-\n1546459, CCF-1564000, DMS-1264033, and NIH via R01 GM117594-01 as part of the Joint\nDMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical\nSciences.\n\n4www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html\n5mulan.sourceforge.net/datasets-mlc.html\n\n8\n\n100101102103iteration10-1100objectivePOSGDMMSoft-BCFW-\u03c1=1Soft-BCFW-\u03c1=10100101102iteration1.41.51.61.71.81.922.12.22.32.4Objective\u00d7105ChineseOCRGDMMSoft-BCFW-\u03c1=1Soft-BCFW-\u03c1=10103104time1.41.51.61.71.81.922.12.22.32.4Objective\u00d7105ChineseOCRGDMMGDMM-subFMO102104time100101Relative-ObjectivePOSBCFWGDMM-subFMOSSGSoft-BCFW-\u03c1=1Soft-BCFW-\u03c1=10103104time1.522.53Objective\u00d7105ChineseOCRBCFWGDMM-subFMOSSGSoft-BCFW-\u03c1=1Soft-BCFW-\u03c1=10102103104105time104105106107108109ObjectiveRCV1-regionsBCFWGDMM-subFMOSSGSoft-BCFW-\u03c1=1Soft-BCFW-\u03c1=10102104time0.070.080.090.10.110.120.130.14test errorPOSBCFWGDMM-subFMOSSGSoft-BCFW-\u03c1=1Soft-BCFW-\u03c1=10103104time0.450.50.550.60.650.70.750.80.850.90.95test errorChineseOCRBCFWGDMM-subFMOSSGSoft-BCFW-\u03c1=1Soft-BCFW-\u03c1=10102103104105time10-2test errorRCV1-regionsBCFWGDMM-subFMOSSGSoft-BCFW-\u03c1=1Soft-BCFW-\u03c1=10102103104time0.010.020.030.040.050.060.070.080.090.1test errorEUR-LexBCFWGDMM-subFMOSSGSoft-BCFW-\u03c1=1Soft-BCFW-\u03c1=10\fReferences\n[1] D. Das, D. Chen, A. F. Martins, N. Schneider, and N. A. Smith. Frame-semantic parsing. Computational\n\nlinguistics, 40(1):9\u201356, 2014.\n\n[2] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear\n\nclassi\ufb01cation. The Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[3] K. Gimpel and N. A. Smith. Structured ramp loss minimization for machine translation. In NAACL, pages\n\n221\u2013231. Association for Computational Linguistics, 2012.\n\n[4] A. Hoffman. On approximate solutions of systems of linear inequalities. Journal of Research of the\n\nNational Bureau of Standards, 1952.\n\n[5] M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers.\n\narXiv preprint arXiv:1208.3922, 2012.\n\n[6] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural svms. Machine Learning,\n\n77(1):27\u201359, 2009.\n\n[7] M. P. Kumar, V. Kolmogorov, and P. H. Torr. An analysis of convex relaxations for map estimation.\n\nAdvances in Neural Information Processing Systems, 20:1041\u20131048, 2007.\n\n[8] S. Lacoste-Julien and M. Jaggi. On the global linear convergence of frank-wolfe optimization variants.\n\nIn Advances in Neural Information Processing Systems, pages 496\u2013504, 2015.\n\n[9] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate frank-wolfe optimization for\n\nstructural svms. In ICML 2013 International Conference on Machine Learning, pages 53\u201361, 2013.\n\n[10] O. Meshi, M. Mahdavi, and D. Sontag. On the tightness of lp relaxations for structured prediction. arXiv\n\npreprint arXiv:1511.01419, 2015.\n\n[11] O. Meshi, D. Sontag, T. Jaakkola, and A. Globerson. Learning ef\ufb01ciently with approximate inference via\n\ndual losses. 2010.\n\n[12] O. Meshi, N. Srebro, and T. Hazan. Ef\ufb01cient training of structured svms via soft constraints. In AISTAT,\n\n2015.\n\n[13] A. Osokin, J.-B. Alayrac, I. Lukasewitz, P. K. Dokania, and S. Lacoste-Julien. Minding the gaps for block\n\nfrank-wolfe optimization of structured svms. arXiv preprint arXiv:1605.09346, 2016.\n\n[14] P. Ravikumar and J. Lafferty. Quadratic programming relaxations for metric labeling and markov random\n\n\ufb01eld map estimation. In ICML, 2006.\n\n[15] R. Samdani and D. Roth. Ef\ufb01cient decomposed learning for structured prediction. ICML, 2012.\n\n[16] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nsvm. Mathematical programming, 2011.\n\n[17] T. Su, T. Zhang, and D. Guan. Corpus-based hit-mw database for of\ufb02ine recognition of general-purpose\n\nchinese handwritten text. IJDAR, 10(1):27\u201338, 2007.\n\n[18] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: A large\n\nmargin approach. In ICML, 2005.\n\n[19] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in neural information\n\nprocessing systems, volume 16, 2003.\n\n[20] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interde-\n\npendent and structured output spaces. In ICML, 2004.\n\n[21] P. Woodland and D. Povey. Large scale discriminative training for speech recognition.\n\nAutomatic Speech Recognition Workshop (ITRW), 2000.\n\nIn ASR2000-\n\n[22] I. E. Yen, X. Lin, E. J. Zhang, E. P. Ravikumar, and I. S. Dhillon. A convex atomic-norm approach to\n\nmultiple sequence alignment and motif discovery. 2016.\n\n[23] I. E. Yen, D. Malioutov, and A. Kumar. Scalable exemplar clustering and facility location via augmented\n\nblock coordinate descent with column generation. In AISTAT, 2016.\n\n[24] I. E.-H. Yen, K. Zhong, C.-J. Hsieh, P. K. Ravikumar, and I. S. Dhillon. Sparse linear programming via\n\nprimal and dual augmented coordinate descent. In NIPS, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2582, "authors": [{"given_name": "Ian En-Hsu", "family_name": "Yen", "institution": "University of Texas at Austin"}, {"given_name": "Xiangru", "family_name": "Huang", "institution": "University of Texas at Austin"}, {"given_name": "Kai", "family_name": "Zhong", "institution": "University of Texas at Austin"}, {"given_name": "Ruohan", "family_name": "Zhang", "institution": "University of Texas at Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "Carnegie Mellon University"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}