{"title": "A Smoother Way to Train Structured Prediction Models", "book": "Advances in Neural Information Processing Systems", "page_first": 4766, "page_last": 4778, "abstract": "We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optimization algorithm for the structural support vector machine. The proposed algorithm blends an extrapolation scheme for acceleration and an adaptive smoothing scheme and builds upon the stochastic variance-reduced gradient algorithm. We establish its worst-case global complexity bound and study several practical variants. We present experimental results on two real-world problems, namely named entity recognition and visual object localization. The experimental results show that the proposed framework allows us to build upon efficient inference algorithms to develop large-scale optimization algorithms for structured prediction which can achieve competitive performance on the two real-world problems.", "full_text": "A Smoother Way to Train\n\nStructured Prediction Models\n\nKrishna Pillutla, Vincent Roulet, Sham M. Kakade, Zaid Harchaoui\n\nPaul G. Allen School of Computer Science & Engineering and Department of Statistics\n\nUniversity of Washington\n\nname@uw.edu\n\nAbstract\n\nWe present a framework to train a structured prediction model by performing\nsmoothing on the inference algorithm it builds upon. Smoothing overcomes the\nnon-smoothness inherent to the maximum margin structured prediction objective,\nand paves the way for the use of fast primal gradient-based optimization algorithms.\nWe illustrate the proposed framework by developing a novel primal incremental\noptimization algorithm for the structural support vector machine. The proposed\nalgorithm blends an extrapolation scheme for acceleration and an adaptive smooth-\ning scheme and builds upon the stochastic variance-reduced gradient algorithm.\nWe establish its worst-case global complexity bound and study several practical\nvariants. We present experimental results on two real-world problems, namely\nnamed entity recognition and visual object localization. The experimental results\nshow that the proposed framework allows us to build upon ef\ufb01cient inference\nalgorithms to develop large-scale optimization algorithms for structured prediction\nwhich can achieve competitive performance on the two real-world problems.\n\n1\n\nIntroduction\n\nConsider the optimization problem arising when training structural support vector machines:\n\n(cid:34)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:35)\n\nmin\nw\u2208Rd\n\nF (w) :=\n\nf (i)(w) +\n\n(cid:107)w(cid:107)2\n\n2\n\n\u03bb\n2\n\n,\n\n(1)\n\nwhere each f (i) is the structural hinge loss. Structural support vector machines were designed for\nprediction problems where outputs are discrete data structures such as sequences or trees [59, 65].\nBatch nonsmooth optimization algorithms such as cutting plane methods are appropriate for problems\nwith small or moderate sample sizes [65, 21]. Stochastic nonsmooth optimization algorithms such as\nstochastic subgradient methods can tackle problems with large sample sizes [49, 57]. However both\nfamilies of methods achieve the typical worst-case complexity bounds of nonsmooth optimization\nalgorithms and cannot easily leverage a possible hidden smoothness of the objective.\nFurthermore, as signi\ufb01cant progress is being made on incremental smooth optimization algorithms\nfor training unstructured prediction models [36], we would like to transfer such advances and design\nfaster optimization algorithms to train structured prediction models. Indeed if each term in the\n\ufb01nite-sum were L-smooth 1, incremental optimization algorithms such as MISO [37], SAG [33, 53],\nSAGA [10], SDCA [55], and SVRG [23] could leverage the \ufb01nite-sum structure of the objective (1)\nand achieve faster convergence than batch algorithms on large-scale problems.\n\n1We say f is L-smooth with respect to (cid:107)\u00b7(cid:107) when \u2207f exists everywhere and is L-Lipschitz with respect to\n\n(cid:107)\u00b7(cid:107). Smoothness and strong convexity are taken to be with respect to (cid:107) \u00b7 (cid:107)2 unless stated otherwise.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIncremental optimization algorithms can be further accelerated, either on a case-by-case basis [56,\n14, 1, 9] or using the Catalyst acceleration scheme [35, 36], to achieve near-optimal convergence\nrates [67]. Accelerated incremental optimization algorithms demonstrate stable and fast convergence\nbehavior on a wide range of problems, in particular for ill-conditioned ones.\nWe introduce a general framework that allows us to bring the power of accelerated incremental\noptimization algorithms to the realm of structured prediction problems. To illustrate our framework,\nwe focus on the problem of training a structural support vector machine (SSVM). The same ideas can\nbe applied to other structured prediction models to obtain faster training algorithms.\nWe seek primal optimization algorithms, as opposed to saddle-point or primal-dual optimization\nalgorithms, in order to be able to tackle structured prediction models with af\ufb01ne mappings such as\nSSVM as well as deep structured prediction models with nonlinear mappings. We show how to shade\noff the inherent non-smoothness of the objective while still being able to rely on ef\ufb01cient inference\nalgorithms.\nSmooth inference oracles. We introduce a notion of smooth inference oracles that gracefully \ufb01ts\nthe framework of black-box \ufb01rst-order optimization. While the exp inference oracle reveals the\nrelationship between max-margin and probabilistic structured prediction models, the top-K inference\noracle can be ef\ufb01ciently computed using simple modi\ufb01cations of ef\ufb01cient inference algorithms in\nmany cases of interest.\nIncremental optimization algorithms. We present a new algorithm built on top of SVRG, blending\nan extrapolation scheme for acceleration and an adaptive smoothing scheme. We establish the worst-\ncase complexity bounds of the proposed algorithm and demonstrate its effectiveness compared to\ncompeting algorithms on two tasks, namely named entity recognition and visual object localization.\nThe code is publicly available on the authors\u2019 websites. All the proofs are provided in [48].\n\n2 Smoothing Inference for Structured Prediction\nGiven an input x \u2208 X of arbitrary structure, e.g. a sentence, a structured prediction model outputs its\nprediction as a structured object y \u2208 Y, such as a parse tree, where the set of all outputs Y may be\n\ufb01nite yet often large. The score function \u03c6, parameterized by w \u2208 Rd, quanti\ufb01es the compatibility\nof an input x and an output y as \u03c6(x, y; w). It is assumed to decompose onto the structure at hand\nsuch that the inference problem y\u2217(x; w) \u2208 argmaxy\u2208Y \u03c6(x, y; w) can be solved ef\ufb01ciently by a\ncombinatorial optimization algorithm. Training a structured prediction model then amounts to \ufb01nding\nthe best score function such that the inference procedure provides correct predictions.\nStructural hinge loss. The standard formulation uses a feature map \u03a6 : X \u00d7Y \u2192 Rd such that score\nfunctions are linear in w, i.e. \u03c6(x, y; w) = \u03a6(x, y)(cid:62)w. The structural hinge loss, an extension of\nbinary and multi-class hinge losses, considers a majorizing surrogate of a given loss function (cid:96) such\nas the Hamming loss, that measures the error incurred by predicting y\u2217(x; w) on a sample (x, y) as\n(cid:96)(y, y\u2217(x; w)). For an input-output pair (xi, yi), the structural hinge loss is de\ufb01ned as\ny(cid:48)\u2208Y \u03c8i(y(cid:48); w) ,\n\ny(cid:48)\u2208Y {\u03c6(xi, y(cid:48); w) + (cid:96)(yi, y(cid:48))} \u2212 \u03c6(xi, yi; w) = max\n\nf (i)(w) = max\n\n(2)\n\nwhere \u03c8i(y(cid:48); w) := \u03c6(xi, y(cid:48); w) + (cid:96)(yi, y(cid:48)) \u2212 \u03c6(xi, yi; w) = a(cid:62)\ni,y(cid:48)w + bi,y(cid:48) is the augmented\nscore function, an af\ufb01ne function of w. The loss (cid:96) is also assumed to decompose onto the structure so\nthat the maximization in (2), also known as loss augmented inference, is no harder than the inference\nproblem consisting in computing y\u2217(x; w). The learning problem (1) is the minimization of the\nstructural hinge losses on the training data (xi, yi)n\ni=1 with a regularization penalty. We shall refer to\na generic term f (w) = maxy(cid:48)\u2208Y \u03c8(y(cid:48); w) in the \ufb01nite-sum from now on.\nSmoothing strategy. To smooth the structural hinge loss, we decompose it as the composition of\nthe max function with a linear mapping. The former can then be easily smoothed through its dual\nformulation to obtain a smooth surrogate of (2). Formally, de\ufb01ne the mapping g and the max function\nh respectively as\n\n(cid:26)Rm \u2192 R\n\n,\n\n(3)\n\n(cid:40)Rd \u2192 Rm\n\ng :\n\nw (cid:55)\u2192 (\u03c8(y(cid:48); w))y(cid:48)\u2208Y = Aw + b\n\n,\n\nh :\n\nz\n\n(cid:55)\u2192 maxi\u2208[m] zi\n\nwhere m = |Y|. The structural hinge loss can now be expressed as f = h \u25e6 g.\n\n2\n\n\f(a) Non-smooth.\n\n(b) (cid:96)2\n\n2 smoothing.\n\n(c) Entropy smoothing.\n\nFigure 1: Viterbi trellis for a chain graph with four nodes and three labels.\n\nThe max function can be written as h(z) = maxi\u2208[m] zi = maxu\u2208\u2206m\u22121 z(cid:62)u where \u2206m\u22121 is the\nprobability simplex in Rm. Its simplicity allows us to analytically compute its in\ufb01mal convolution\nwith a smooth function [2]. The smoothing h\u00b5\u03c9 of h by a strongly convex function \u03c9 with smoothing\ncoef\ufb01cient \u00b5 > 0 is de\ufb01ned as\n\n(cid:8)z(cid:62)u \u2212 \u00b5\u03c9(u)(cid:9) ,\n\nh\u00b5\u03c9(z) := maxu\u2208\u2206m\u22121\n\nwhose gradient is the maximizer of the above expression. The smooth approximation of the structural\nhinge loss is then given by f\u00b5\u03c9 := h\u00b5\u03c9\u25e6g. This smoothing technique was introduced by Nesterov [43]\nwho showed that if \u03c9 is 1-strongly convex with respect to (cid:107) \u00b7 (cid:107)\u03b1, then f\u00b5\u03c9 is ((cid:107)A(cid:107)2\n2,\u03b1/\u00b5)-smooth2,\nand approximates f for any w as\n\u00b5 minu\u2208\u2206m\u22121 \u03c9(u) \u2264 f (w) \u2212 f\u00b5\u03c9(w) \u2264 \u00b5 maxu\u2208\u2206m\u22121 \u03c9(u) .\n\nSmoothing variants. We focus on the negative entropy and the squared Euclidean norm as choices\nfor \u03c9, denoted respectively\n\ni=1 ui log ui\n\nand\n\n2(u) := 1\n(cid:96)2\n\n2 ((cid:107)u(cid:107)2\n\n2 \u2212 1) .\n\n\u2212H(u) :=(cid:80)m\n(cid:104)\n\n(cid:80)m\n\n(cid:105)\n\n(cid:110)\n\n[K]u \u2212 \u00b5(cid:96)2\nz(cid:62)\n\n(cid:111)\n\nThe gradient of their corresponding smooth counterparts can be computed respectively by the softmax\nand the orthogonal projection onto the simplex, i.e.\n\n\u2207h\u2212\u00b5H (z) =\n\nexp(zi/\u00b5)\nj=1 exp(zj /\u00b5)\n\ni=1,...,m\n\nand\n\n\u2207h\u00b5(cid:96)2\n\n2\n\n(z) = proj\u2206m\u22121 (z/\u00b5) .\n\nThe gradient of the smooth surrogate f\u00b5\u03c9 can be written using the chain rule. This involves computing\n\u2207g along all m = |Y| of its components, which may be intractable. However, for the (cid:96)2\n2 smoothing,\nthe gradient \u2207h\u00b5(cid:96)2\n(z) is given by the projection of z/\u00b5 onto the simplex, which selects a small\nnumber, denoted Kz/\u00b5, of its largest coordinates. We shall approximate this projection by \ufb01xing K\nindependently of z/\u00b5 and de\ufb01ning\n\n2\n\n2\n\n,\n\n2(u)\n\nu\u2208\u2206K\u22121\n\nh\u00b5,K(z) = max\n(z), where z[K] \u2208 RK denote the K largest components of z. If\nas an approximation of h\u00b5(cid:96)2\nKz/\u00b5 < K this approximation is exact and for \ufb01xed z, this holds for small enough \u00b5, as shown\nin [48]. The resulting surrogate is denoted f\u00b5,K = h\u00b5,K \u25e6 g.\nSmooth inference oracles. We de\ufb01ne a smooth inference oracle as a \ufb01rst-order oracle for a smooth\ncounterpart of the structural hinge loss. Recall that a \ufb01rst-order oracle for a function f is a numerical\nroutine which, given a point w \u2208 dom(f ), returns the function value f (w) and a (sub)gradient\nv \u2208 \u2202f (w). We de\ufb01ne three variants of a smooth inference oracle: i) the max oracle; ii) the exp\noracle; iii) the top-K oracle. The max oracle corresponds to the usual inference oracle in maximum\nmargin structured prediction, while the exp oracle and the the top-K oracle correspond resp. to the\nentropy-based and (cid:96)2\nFigure 1 illustrates the notion on a chain structured output. The inference problem is non-smooth\nand a small change in w might lead to a radical change in the best scoring path as shown in Fig. 1a.\nThe (cid:96)2\n2-based smooth inference amounts to picking some number of the top scoring paths. Notice the\nsparsity pattern in Fig. 1b. The entropy-based smooth inference amounts to weighting all paths, with\na higher weight for top scoring paths as shown in Fig. 1c.\n\n2-based smoothing.\n\n2 (cid:107)A(cid:107)\u03b2,\u03b1 := max{u(cid:62)Aw |(cid:107)u(cid:107)\u03b1 \u2264 1 , (cid:107)w(cid:107)\u03b2 \u2264 1}.\n\n3\n\nShesellsseashellsShesellsseashellsShesellsseashells\fTable 1: Smooth inference oracles, algorithms and complexity. Here, p is the size of each y \u2208 Y.\nThe time complexity is phrased in terms of the time complexity T of the max oracle.\n\nMax oracle\n\nAlgo\n\nDynamic\n\nProgramming\n\nGraph cut\n\nGraph matching\n\nBranch and\nBound search\n\nTop-K oracle\n\nAlgo\n\nTime\n\nExp oracle\nAlgo\n\nTop-K DP\n\nO(KT log K)\n\nSum-Product\n\nTime\nO(T )\n\nBMMF\n\nBMMF\n\nTop-K search\n\nO(pKT )\nO(KT )\nN/A\n\nIntractable\n\nIntractable\n\nIntractable\n\nDe\ufb01nition 1. Consider f (w) = maxy(cid:48)\u2208Y \u03c8(y(cid:48); w) and w \u2208 Rd,\n\u2022 the max oracle returns f (w) and \u2207\u03c8(y\u2217; w) \u2208 \u2202f (w), where y\u2217 \u2208 argmaxy(cid:48)\u2208Y \u03c8(y(cid:48); w);\n\u2022 the exp oracle returns f\u2212\u00b5H (w) and \u2207f\u2212\u00b5H (w) = Ey(cid:48)\u223cp\u00b5[\u2207\u03c8(y(cid:48); w)], where p\u00b5(y(cid:48)) \u221d\nexp(\u03c8(y(cid:48); w)/\u00b5);\n\u2022 the top-K oracle computes the K best outputs {y(i)}K\n\n\u03c8(y(1); w) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c8(y(K); w) \u2265 max\ny(cid:48)\u2208Y\\YK\n(w) and \u2207f\u00b5(cid:96)2\n\nto return f\u00b5,K(w) and \u2207f\u00b5,K(w) as surrogates for f\u00b5(cid:96)2\n\n2\n\ni=1 = YK satisfying\n\u03c8(y(cid:48); w)\n\n(w).\n\n2\n\nOn the one hand, the entropy-based smoothing of a structural support vector machine somewhat\ninterpolates between a regular structural support vector machine and a conditional random \ufb01eld [31]\nthrough the smoothing parameter \u00b5. On the other hand, the (cid:96)2\n2-based smoothing only requires a top-K\noracle, making it a more practical option, as illustrated in Table 1.\nSmooth inference algorithms. The implementation of inference oracles depends on the structure of\nthe output, given by a probabilistic graphical model [48]. When the latter is a tree, exact procedures\nare available, otherwise some algorithms may not be practical. See Table 1 for a summary3. The\nformal description, algorithms and proofs of correctness are provided in [48].\nDynamic Programming. For graphs with a tree structure or bounded tree-width, the max oracle is\nimplemented by dynamic programming (DP) algorithms such as the popular Viterbi algorithm. The\nexp-oracle can be achieved by replacing the max in DP with log-sum-exp and using back-propagation\nat O(1) times the cost of the max oracle. The top-K oracle is implemented by the top-K DP\nalgorithm which keeps track of the K largest intermediate scores and the back-pointers at O(K)\ntimes the cost of the max oracle; see [48] for details.\nGraph cut and matching. For speci\ufb01c probabilistic graphical models, exact inference is possible in\nloopy graphs by the use of graph cuts [27] or perfect matchings in bipartite graphs [60]. In this case,\na top-K oracle can be implemented by the best max marginal \ufb01rst (BMMF) algorithm [68] at 2K\ncomputations of max-marginals, which can be ef\ufb01ciently computed for graph cuts [26] and matchings\n[12]. The exp oracle is intractable (in fact, it is #P-complete) [20].\nBranch and bound search. In special cases, branch and bound search allows exact inference in loopy\ngraphs by partitioning Y and exploring promising parts \ufb01rst using a heuristic. Examples include the\ncelebrated ef\ufb01cient subwindow search [32] in computer vision or A(cid:63) algorithm in natural language\nprocessing [34, 18]. Here, the top-K oracle can be implemented by letting the search run until K\noutputs are found while the exp oracle is intractable. The running time of both the max and top-K\noracles depends on the heuristic used and might be exponential in the worst case.\n\n3 The notation O(\u00b7) may hide constants and factors logarithmic in problem parameters. See [48] for detailed\n\ncomplexities.\n\n4\n\n\f(cid:34)\n\n(cid:35)\n\n,\n\n(5)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nAlgorithm 1 Catalyst with smoothing\n1: Input: Objective F in (1), linearly convergent method M, initial w0, \u03b10 \u2208 (0, 1).\n2: Initialize: z0 = w0.\n3: for k = 1 to T do\n4:\n\nUsing M with zk\u22121 as the starting point, \ufb01nd\n\nSmoothing (\u00b5k)k\u22651 and regularization (\u03bak)k\u22651 parameters, relative accuracies (\u03b4k)k\u22651.\n\nwk \u2248 argmin\nw\u2208Rd\n\nF\u00b5k\u03c9,\u03bak (w; zk\u22121) :=\n\nf (i)\n\u00b5k\u03c9(w) +\n\n(cid:107)w(cid:107)2\n\n2 +\n\n\u03bb\n2\n\n\u03bak\n2\n\n(cid:107)w \u2212 zk\u22121(cid:107)2\n\n2\n\nsuch that F\u00b5k\u03c9,\u03bak (wk; zk\u22121) \u2212 minw F\u00b5k\u03c9,\u03bak (w; zk\u22121) \u2264 \u03b4k\u03bak\nCompute \u03b1k and \u03b2k such that\n\n2 (cid:107)wk \u2212 zk\u22121(cid:107)2\n2.\n\n5:\n\nk(\u03bak+1 + \u03bb) = (1 \u2212 \u03b1k)\u03b12\n\u03b12\n\nk\u22121(\u03bak + \u03bb) + \u03b1k\u03bb ,\n\n\u03b2k = \u03b1k\u22121(1\u2212\u03b1k\u22121)(\u03bak+\u03bb)\n\nk\u22121(\u03bak+\u03bb)+\u03b1k(\u03bak+1+\u03bb) .\n\u03b12\n\nSet zk = wk + \u03b2k(wk \u2212 wk\u22121).\n\n6:\n7: end for\n8: return wT .\n\n3 Catalyst with smoothing\n\n2(cid:107)w(cid:107)2\n\nFor a single input-output pair (n = 1), the problem (1) is minw\u2208Rd h(Aw + b) + \u03bb\n2, where\nh is a simple non-smooth convex function. The Nesterov smoothing technique overcomes the non-\nsmoothness of the objective by considering a smooth surrogate instead [43, 42]. We combine this\nwith the Catalyst scheme to accelerate a linearly-convergent smooth optimization algorithm [36].\nCatalyst with smoothing. The Catalyst approach considers at each outer iteration a regularized\nobjective centered around the current iterate [36]. The algorithm proceeds by performing approximate\nproximal point steps, that is from a point z and for a step-size 1/\u03ba one computes the minimizer\nof minw\u2208Rm F (w) + \u03ba\n2. We only need an approximate solution returned by a given\noptimization method M that enjoys a linear convergence guarantee.\nWe extend the Catalyst approach to non-smooth optimization problems by performing adaptive\nsmoothing in the outer-loop and adjusting the level of accuracy accordingly in the inner-loop. We\nde\ufb01ne\n\n2(cid:107)w \u2212 z(cid:107)2\n\n(cid:107)w(cid:107)2\n\n\u03bb\n2\n\n(cid:107)w \u2212 z(cid:107)2\n\n2\n\n\u03ba\n2\n\ni=1\n\n2 +\n\nf (i)\n\u00b5k\u03c9(w) +\n\nF\u00b5\u03c9,\u03ba(w; z) :=\n\n(4)\nas a smooth surrogate to the objective centered around a given point z \u2208 Rd. Note that the original\nCatalyst considered a \ufb01xed regularization term \u03ba [36], while we vary \u03ba and \u00b5. Doing so enables us to\nget adaptive smoothing strategies.\nThe proposed inner-outer scheme is presented in Algorithm 1. In view of the strong convexity of\nF\u00b5k\u03c9,\u03bak (\u00b7 ; zk\u22121), the stopping criterion for the subproblem (5) can be checked by looking at the\ngradient of F\u00b5k\u03c9,\u03bak (\u00b7 ; zk\u22121). As it is smooth and strongly convex, the maximal number of iterations\nto satisfy the stopping criterion can also be derived. In practice, however, we recommend a practical\nvariant similar in spirit to the one proposed by [36] that lets M run for a \ufb01xed budget of iterations in\neach inner loop. Below, we denote w\u2217 \u2208 argminw\u2208Rd F (w) and F \u2217 = F (w\u2217).\nTheorem 1. Consider problem (1) and a smoothing function \u03c9 s.t. \u2212D \u2264 \u03c9(u) \u2264 0 for all u \u2208 \u2206.\nAssume parameters (\u00b5k)k\u22651, (\u03bak)k\u22651, (\u03b4k)k\u22651 of Algorithm 1 are non-negative with (\u00b5k)k\u22651\nnon-increasing, \u03b4k \u2208 [0, 1), and \u03b1k \u2208 (0, 1) for all k. Then, Algorithm 1 generates (wk)k\u22650 such\nthat\n\nn(cid:88)\n\n1\nn\n\nF (wk) \u2212 F \u2217 \u2264 Ak\u22121\n0Bk\nj =(cid:81)k\nj =(cid:81)k\n\ni=j(1\u2212 \u03b1i), Bk\n\n1\n\nwhere Ak\nand \u00b50 = 2\u00b51.\n\n\u22060 + \u00b5kD +\n\n(\u00b5j\u22121 \u2212 (1 \u2212 \u03b4j)\u00b5j) D ,\n\n(6)\n\ni=j(1\u2212 \u03b4i), \u22060 = F (w0)\u2212 F \u2217 + (\u03ba1+\u03bb)\u03b12\n2(1\u2212\u03b10)\n\n0\u2212\u03bb\u03b10\n\n(cid:107)w0 \u2212 w\u2217(cid:107)2\n2 ,\n\nk(cid:88)\n\nj=1\n\nAk\u22121\njBk\n\nj\n\n5\n\n\fTable 2: Summary of global complexity of SC-SVRG, i.e., Algorithm 1 with SVRG as the inner solver for\nvarious parameter settings. We show E[N ], the expected total number of SVRG iterations required to obtain an\naccuracy \u0001, up to constants and factors logarithmic in problem parameters. We denote \u2206F0 := F (w0) \u2212 F \u2217\nand \u22060 = (cid:107)w0 \u2212 w\u2217(cid:107)2. Constants a and D are de\ufb01ned so that \u2212D \u2264 \u03c9(u) \u2264 0 for all u \u2208 \u2206 and each f (i)\nis a/\u00b5-smooth for i \u2208 [n].\n\n\u00b5\u03c9\n\nScheme\n\n\u03bb > 0\n\n1\n\n2\n\n3\n\n4\n\nYes\n\nYes\n\nNo\n\nNo\n\n\u00b5k\n\n\u0001\nD\n\n\u03bak\n\u0001n \u2212 \u03bb\n\naD\n\n\u00b5ck\n\n\u03bb\n\n\u03b4k\n\n(cid:113) \u03bb\u0001n\n\naD\n\nc(cid:48)\n\n\u0001/D\n\naD/\u0001n\n\n1/k2\n\nn\n\n\u00b5/k\n\n\u03ba0 k\n\n1/k2\n\nn +\n\nn + a\n\u03bb\u0001\n\n(cid:113) \u2206F0\n(cid:16)\n(cid:98)\u22060\n\n\u0001\n\nE[N ]\n\n(cid:113) aDn\n\n\u03bb\u0001\n\n\u2206F0+\u00b5D\n\n\u00b5\n\u221a\n\n(cid:17)\n\n\u0001\n\n\u0001 +\n\naDn\u22060\n\nn + a\n\u00b5\u03ba0\n\nRemark\n\n\ufb01x \u0001 in advance\n\nc, c(cid:48) < 1 are universal constants\n\n\ufb01x \u0001 in advance\n\n(cid:98)\u22060 = \u2206F0 + \u03ba0\n\n2 \u22062\n\n0 + \u00b5D\n\nTheorem 1 establishes the complexity of the Catalyst smoothing scheme for a general smoothing\nfunction and a general linearly-convergent smooth optimization algorithm M.\nUsing Theorem 1, we can derive strategies for strongly or non-strongly convex objectives (\u03bb > 0 or\nnot) with adaptive smoothing that vanishes over time to get progressively better surrogates of the\noriginal objective.\nThe global complexity of the algorithm depends then on the choice of M. We present in Table 2 the\ntotal complexity for different strategies when SVRG [23] is used as M, resulting in an algorithm\ncalled SC-SVRG in the remainder of the paper. Note that the adaptive smoothing schemes (2, 4) do\nnot match the rate obtained by a \ufb01xed smoothing (1, 3). A standard doubling trick can easily \ufb01x this.\nYet we choose to use an adaptive smoothing scheme, easier to use and working well in practice (see\nSec. 4). All proofs are given in [48].\nExtension to nonlinear mappings. When the score function is not linear in w, the overall problem\nis not convex in general. However, if the score function is smooth, then one could take advantage\nof the composite structure of the structural hinge loss f = h \u25e6 g by using the prox-linear algorithm\n[6, 11]. At each step, the latter linearizes the mapping g around the current iterate wk, resulting in a\nconvex model w (cid:55)\u2192 h(wk + \u2207g(wk)(cid:62)(w \u2212 wk)) of h \u25e6 g around wk. The overall convex model\nof the objective F with an additional proximal term is then minimized. The next iterate is given by\n\nn(cid:88)\n\ni=1\n\nwk+1 = argmin\nw\u2208Rd\n\n1\nn\n\nh(g(i)(w) + \u2207g(i)(wk)(cid:62)(w \u2212 wk)) +\n\n(cid:107)w(cid:107)2\n\n2 +\n\n\u03bb\n2\n\n1\n2\u03b3\n\n(cid:107)w \u2212 wk(cid:107)2\n\n(7)\n\nwhere g(i) is the mapping associated with the ith sample and \u03b3 > 0 is the parameter of the proximal\nterm. This subproblem reduces to training a structured prediction model with an af\ufb01ne augmented\nscore function. Therefore we can solve it with the SC-SVRG algorithm introduced earlier. Note that\nonly approximate solutions are required to get a global convergence to a stationary point [11]. The\ntheoretical analysis and numerical experiments showing the potential of this approach compared to\nsubgradient methods can be found in [48].\n\n4 Experiments\n\nWe compare the proposed algorithm and several competing algorithms to train a structural support\nvector machine on the tasks of named entity recognition and visual object localization. Additional\ndetails on the datasets, algorithms, parameters as well as an extensive evaluation in different settings\ncan be found in [48]. We use the (cid:96)2\nNamed entity Recognition. The task consists in predicting the tagging of a sequence into named\nentities. We consider the CoNLL 2003 dataset with n = 14987 [63]. The Viterbi algorithm provides\nan ef\ufb01cient max oracle and the top-K oracle is obtained following the discussion in Sec. 2. The loss\n(cid:96) is the Hamming loss here. The features \u03a6(x, y) are obtained from the local context around each\nword [64]. We use the F1 score as the performance metric for evaluation.\n\n2-based smoothing in all experiments as explained in Sec. 2.\n\n6\n\n\f(a) Performance on CoNLL-2003 for named entity recognition.\n\n(b) Sample performance on PASCAL VOC 2007 for visual object localization for \u03bb = 10/n.\n\nFigure 2: Experimental comparison of proposed methods for the tasks of named entity recognition (Fig. 2a)\nand visual object localization (Fig. 2b). All shaded areas represent one standard deviation over ten random runs.\nSee [48] for all plots.\n\nparameters to be tuned, and returns the averaged iterate wt = 2/t(t + 1)(cid:80)t\n\nVisual object localization. The task consists in predicting the spatial location of the visual object\nin an image. We consider the PASCAL VOC 2007 [13] dataset and focus on the \u201ccat\u201d and \u201cdog\u201d\ncategories. Additional experimental results for other categories can be found in [48]. We train\nan independent classi\ufb01er for each class. We follow the methodology outlined in [15] to construct\n\u03a6(x, y). We crop image x to the bounding box y, resize the resulting patch and pass it through\na convolutional network pre-trained on a different dataset. We use here AlexNet [28] pre-trained\non ImageNet [50] and take as \u03a6(x, y) the output of the layer conv4. We use selective search to\nrestrict |Y| to 1000 [66]. The max and top-K oracles are implemented as exhaustive searches over\nthis reduced set. We use 1 \u2212 IoU as the task loss where IoU(y, y(cid:48)) = Area(y \u2229 y(cid:48))/Area(y \u222a y(cid:48)).\nMoreover, the ground truth label y is replaced by argmaxy(cid:48)\u2208Y IoU(y, y(cid:48)). We use the average\nprecision (AP) as the performance metric for evaluation [13] .\nMethods. The plots compare two non-smooth optimization methods, (a) SGD, which is a primal\nstochastic subgradient method with step-sizes chosen as \u03b3t = \u03b30/(1 + t/t0), where \u03b30, t0 are\nj=1 jwj [29], and (b)\nBCFW, the Block-Coordinate Frank-Wolfe algorithm [30], with the tuning of the parameters proposed\nby the authors, and the averaged iterate as above (bcfw-wavg). The methods that use smoothing\nare SVRG [23] with constant smoothing and two variants of SC-SVRG, namely SC-SVRG-const,\nwhich uses constant smoothing (Scheme 1 in Table 2) and SC-SVRG-adapt, which uses adaptive\nsmoothing (Scheme 2 in Table 2). Note that the step-size scheme of SGD does not follow from a\nclassical theoretical analysis, yet performs better in practice than the one used by Pegasos [57].\nParameters. BCFW requires no tuning, while SGD requires the tuning of \u03b30 and t0. The SVRG-\nbased methods require the tuning of a \ufb01xed learning rate. Moreover, SVRG and SC-SVRG-const\nalso require tuning the amount of smoothing \u00b5. The validation F1 score and the train loss are used as\nthe tuning criteria for named entity recognition and visual object localization respectively. A \ufb01xed\nbudget Tinner = n is used as the stopping criteria in Algorithm 1. This corresponds to the one-pass\nheuristic of [36], who found the theoretical stopping criteria to be overly pessimistic. We use the\nvalue \u03bak = \u03bb for SC-SVRG-adapt. All smooth optimization methods turned out to be robust to the\nchoice of K for the top-K oracle (Fig. 3) - we use K = 5 for named entity recognition and K = 10\nfor visual object localization.\nExperiments. We present in Fig. 2 the convergence behavior of the different methods on the named\nentity recognition and visual object localization tasks. We plot the error on the training set vs. the\nnumber of oracle calls and the performance metric on a held-out set vs. the number of oracle calls.\n\n7\n\n020406080100#(oracle calls)/n1001.1\u00d7100lossTrain loss, =0.01/n020406080100#(oracle calls)/n0.750.760.770.780.79F1Val. F1, =0.01/n020406080100#(oracle calls)/n1001.2\u00d7100lossTrain loss, =0.1/n020406080100#(oracle calls)/n0.7600.7650.7700.7750.7800.7850.790F1Val. F1, =0.1/nSC-SVRG-constSC-SVRG-adaptSVRGBCFWSGD (1/t)051015202530#(oracle calls)/n1002\u00d71013\u00d71014\u00d71016\u00d7101lossTrain loss (cat)051015202530#(oracle calls)/n0.380.400.420.44APVal. AP (cat)051015202530#(oracle calls)/n1003\u00d71014\u00d71016\u00d7101lossTrain loss (dog)051015202530#(oracle calls)/n0.200.230.250.280.30APVal. AP (dog)SC-SVRG-constSC-SVRG-adaptSVRGBCFWSGD (1/t)\f(a) Effect of smoothing hyperparameter on SC-SVRG-const and SC-SVRG-adapt for CoNLL-2003.\n\n(b) Effect of hyperparameter K on SC-SVRG-const and SC-SVRG-adapt for CoNLL-2003.\nFigure 3: Effect of hyperparameters \u00b5 and K on SC-SVRG-const and SC-SVRG-adapt.\n\nAs we can see in Fig. 2, the proposed methods converge faster in terms of training error while\nachieving a competitive performance in terms of the performance metric on a held-out set. Further-\nmore, BCFW and SGD make twice as many actual passes as SVRG based algorithms. In Fig. 3, we\nexplore the effect of the parameters \u00b5 and K on the convergence of the different methods. We can\nsee that SC-SVRG-adapt is rather robust to the choice of \u00b5, while SC-SVRG-const and SVRG are\nmore sensitive to the choice of \u00b5. Therefore SC-SVRG-adapt seems to appear as the most practical\nvariant of our approach. We can also notice that SC-SVRG-adapt is rather robust to the choice of K.\nSetting K = 5 is suf\ufb01cient here to obtain competitive results.\n\n5 Related Work\n\nThe general framework for global training of structured prediction models was introduced in [4] and\napplied to handwriting recognition in [3] and to document processing in [5].\nSmooth inference oracles. Smooth inference oracles with (cid:96)2\n2-smoothing echo older heuristics in\nspeech and language processing [25]. In the probabilistic graphical models literature, ef\ufb01cient\nalgorithms to solve the top-K inference combinatorial optimization problems were studied under\nthe name \u201cM-best MAP\u201d in [54, 45, 68]. See [48] for a longer survey. Previous works considering\nsmooth inference oracles yet encompassed by our framework can be found in [22, 24, 51, 41].\nInstances of smooth inference oracles framed in the context of \ufb01rst-order optimization were studied\nin [58, 69] and in [38]. We framed here a general notion of smooth inference oracles in the context of\n\ufb01rst-order optimization. The framework not only includes previously proposed inference oracles but\nalso introduces new ones.\nRelated ideas to ours appear in the independent works [39, 44]. These works partially overlap with\nours, but the papers choose different perspectives, making them complementary to each other. In [39],\nthe authors proceed differently when, e.g., smoothing inference based on dynamic programming.\nMoreover, they do not establish complexity bounds for optimization algorithms making calls to the\nresulting smooth inference oracles. We de\ufb01ne smooth inference oracles in the context of black-box\n\ufb01rst-order optimization and establish worst-case complexity bounds for incremental optimization\nalgorithms making calls to these oracles. Indeed we relate the amount of smoothing controlled by \u00b5\nto the resulting complexity of the optimization algorithms relying on smooth inference oracles.\nBatch and incremental optimization algorithms. Several families of algorithms for structural\nsupport vector machines were proposed. Table 3 gives an overview with their oracle complexities.\nEarly works [59, 65, 21, 62] considered batch dual quadratic optimization (QP) algorithms.\n\n8\n\n020406080100#(oracle calls)/n1002\u00d71003\u00d71004\u00d7100train lossSC-SVRG-const, =102/n020406080100#(oracle calls)/n1002\u00d71003\u00d71004\u00d7100train lossSC-SVRG-const, =1/n020406080100#(oracle calls)/n1002\u00d71003\u00d71004\u00d7100train lossSC-SVRG-adapt, =102/n020406080100#(oracle calls)/n1002\u00d71003\u00d71004\u00d7100train lossSC-SVRG-adapt, =1/n = 1.0 = 8.0 = 32.0 = 2.0 = 8.0 = 32.0020406080100#(oracle calls)/n100lossTrain loss, SC-SVRG-const020406080100#(oracle calls)/n0.770.780.79F1Val. F1, SC-SVRG-const020406080100#(oracle calls)/n100lossTrain loss, SC-SVRG-adapt020406080100#(oracle calls)/n0.770.780.79F1Val. F1, SC-SVRG-adaptK=2K=5K=10\fTable 3: Convergence rates given in the number of calls to various oracles for different optimization algorithms\non the learning problem (1) in case of SSVMs (2). The rates are speci\ufb01ed in terms of the target accuracy \u0001, the\nnumber of training examples n, the regularization \u03bb, the size of the label space |Y|, the feature norm [48]. The\nrates are speci\ufb01ed up to constants and factors logarithmic in the problem parameters. The dependence on the\ninitial error is ignored. * denotes algorithms that make O(1) oracle calls per iteration.\n\nAlgo. (exp oracle)\n\nExponentiated\ngradient* [7]\n\nExcessive gap\nreduction [69]\nThis work*,\n\n\ufb01xed smoothing,\nentropy smoother\n\nThis work*,\n\nadaptive smoothing,\nentropy smoother\n\n# Oracle calls\n(n + log |Y|)R2\n\n(cid:114)\n\n\u03bb\u0001\nlog |Y|\n\n\u03bb\u0001\n\nnR2 log|Y|\n\n\u03bb\u0001\n\nnR\n\n(cid:114)\n\nR2 log|Y|\n\n\u03bb\u0001\n\nn +\n\nAlgo. (max oracle) # Oracle calls\n\nBMRM [62]\n\nQP 1-slack [21]\n\nStochastic\n\nsubgradient* [57]\n\nBlock-Coordinate\nFrank-Wolfe* [30]\n\nnR2\n\u03bb\u0001\nnR2\n\u03bb\u0001\nR2\n\u03bb\u0001\n\nn +\n\nR2\n\u03bb\u0001\n\nAlgo.\n\n(top-K oracle)\nThis work*,\n\n\ufb01xed smoothing,\n\n2 smoother\n(cid:96)2\nThis work*,\n\nadaptive smoothing,\n\n2 smoother\n(cid:96)2\n\n# Oracle calls\n\n(cid:114)\nn(cid:101)R2\nn + (cid:101)R2\n\n\u03bb\u0001\n\n\u03bb\u0001\n\nThe stochastic subgradient method considered by [49, 57] operated directly on the non-smooth\nprimal formulation [49, 57]. More recently, [30] proposed a block coordinate Frank-Wolfe (BCFW)\nalgorithm to optimize the dual formulation of structural support vector machines; see also [46] for\nvariants and extensions. Saddle-point or primal-dual optimization algorithms are another family\nof algorithms, including the dual extra-gradient algorithm of [61] and the mirror-prox algorithms\nof [8, 19]. In [47], an incremental optimization algorithm for saddle-point problems is proposed.\nHowever it is unclear how to extend it to the structured prediction problems we consider here.\nIncremental optimization algorithms for conditional random \ufb01elds were proposed in [52]. We focus\nhere on primal optimization algorithms in order to be able to train structured prediction models with\naf\ufb01ne or nonlinear mappings with a uni\ufb01ed approach, and on incremental optimization algorithms in\norder to be able to scale to large datasets.\n\n6 Conclusion\n\nWe introduced a general notion of smooth inference oracles in the context of black-box \ufb01rst-order\noptimization. This allows us to set the scene to extend the scope of fast incremental optimization\nalgorithms to structured prediction problems owing to a careful blend of a smoothing strategy\nand an acceleration scheme. We illustrated the potential of our framework by proposing a new\nincremental optimization algorithm to train structural support vector machines both enjoying worst-\ncase complexity bounds and demonstrating competitive performance on two real-world problems.\nThis work paves the way to faster incremental primal optimization algorithms for deep structured\nprediction models explored in more detail in [48]. There are several potential venues for future work.\nWhen there is no discrete structure that admits ef\ufb01cient inference algorithms, it could be bene\ufb01cial\nto not treat inference as a blackbox numerical procedure [40, 16, 17]. Instance-level improved\nalgorithms along the lines of [17] could also be interesting to explore.\n\nAcknowledgements This work was supported by NSF Award CCF-1740551, the Washington\nResearch Foundation for innovation in Data-intensive Discovery, and the program \u201cLearning in\nMachines and Brains\u201d of CIFAR.\n\n9\n\n\fReferences\n[1] Z. Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. Journal\n\nof Machine Learning Research, 18:221:1\u2013221:51, 2017.\n\n[2] A. Beck and M. Teboulle. Smoothing and \ufb01rst order methods: A uni\ufb01ed framework. SIAM\n\nJournal on Optimization, 22(2):557\u2013580, 2012.\n\n[3] Y. Bengio, Y. LeCun, C. Nohl, and C. Burges. LeRec: A NN/HMM Hybrid for On-Line\n\nHandwriting Recognition. Neural Computation, 7(6):1289\u20131303, 1995.\n\n[4] L. Bottou and P. Gallinari. A Framework for the Cooperation of Learning Algorithms. In\n\nAdvances in Neural Information Processing Systems, pages 781\u2013788, 1990.\n\n[5] L. Bottou, Y. Bengio, and Y. LeCun. Global Training of Document Processing Systems Using\nGraph Transformer Networks. In Conference on Computer Vision and Pattern Recognition,\npages 489\u2013494, 1997.\n\n[6] J. V. Burke. Descent methods for composite nondifferentiable optimization problems. Mathe-\n\nmatical Programming, 33(3):260\u2013279, Dec 1985.\n\n[7] M. Collins, A. Globerson, T. Koo, X. Carreras, and P. L. Bartlett. Exponentiated gradient\nalgorithms for conditional random \ufb01elds and max-margin markov networks. Journal of Machine\nLearning Research, 9(Aug):1775\u20131822, 2008.\n\n[8] B. Cox, A. Juditsky, and A. Nemirovski. Dual subgradient algorithms for large-scale nonsmooth\n\nlearning problems. Mathematical Programming, 148(1-2):143\u2013180, 2014.\n\n[9] A. Defazio. A simple practical accelerated method for \ufb01nite sums. In Advances in Neural\n\nInformation Processing Systems, pages 676\u2013684, 2016.\n\n[10] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in Neural Information\nProcessing Systems, pages 1646\u20131654, 2014.\n\n[11] D. Drusvyatskiy and C. Paquette. Ef\ufb01ciency of minimizing compositions of convex functions\n\nand smooth maps. Mathematical Programming, Jul 2018.\n\n[12] J. C. Duchi, D. Tarlow, G. Elidan, and D. Koller. Using Combinatorial Optimization within\nMax-Product Belief Propagation. In Advances in Neural Information Processing Systems, pages\n369\u2013376, 2006.\n\n[13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The Pascal Visual\nObject Classes (VOC) challenge. International Journal of Computer Vision, 88(2):303\u2013338,\n2010.\n\n[14] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Un-regularizing: approximate proximal point and\nfaster stochastic algorithms for empirical risk minimization. In International Conference on\nMachine Learning, pages 2540\u20132548, 2015.\n\n[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate\nobject detection and semantic segmentation. In Conference on Computer Vision and Pattern\nRecognition, pages 580\u2013587, 2014.\n\n[16] T. Hazan and R. Urtasun. A Primal-Dual Message-Passing Algorithm for Approximated Large\nScale Structured Prediction. In Advances in Neural Information Processing Systems, pages\n838\u2013846, 2010.\n\n[17] T. Hazan, A. G. Schwing, and R. Urtasun. Blending Learning and Inference in Conditional\n\nRandom Fields. Journal of Machine Learning Research, 17:237:1\u2013237:25, 2016.\n\n[18] L. He, K. Lee, M. Lewis, and L. Zettlemoyer. Deep Semantic Role Labeling: What Works\nand What\u2019s Next. In Annual Meeting of the Association for Computational Linguistics, pages\n473\u2013483, 2017.\n\n10\n\n\f[19] N. He and Z. Harchaoui. Semi-Proximal Mirror-Prox for Nonsmooth Composite Minimization.\n\nIn Advances in Neural Information Processing Systems, pages 3411\u20133419, 2015.\n\n[20] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the Ising model.\n\nSIAM Journal on computing, 22(5):1087\u20131116, 1993.\n\n[21] T. Joachims, T. Finley, and C.-N. J. Yu. Cutting-plane training of structural SVMs. Machine\n\nLearning, 77(1):27\u201359, 2009.\n\n[22] J. K. Johnson. Convex relaxation methods for graphical models: Lagrangian and maximum\nentropy approaches. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA,\n2008.\n\n[23] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[24] V. Jojic, S. Gould, and D. Koller. Accelerated dual decomposition for MAP inference. In\n\nInternational Conference on Machine Learning, pages 503\u2013510, 2010.\n\n[25] D. Jurafsky, J. H. Martin, P. Norvig, and S. Russell. Speech and Language Processing. Pearson\n\nEducation, 2014. ISBN 9780133252934.\n\n[26] P. Kohli and P. H. Torr. Measuring uncertainty in graph cut solutions. Computer Vision and\n\nImage Understanding, 112(1):30\u201338, 2008.\n\n[27] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 26(2):147\u2013159, 2004.\n\n[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems, pages 1097\u20131105,\n2012.\n\n[29] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t) conver-\ngence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002,\n2012.\n\n[30] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-Coordinate Frank-Wolfe\nOptimization for Structural SVMs. In International Conference on Machine Learning, pages\n53\u201361, 2013.\n\n[31] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional Random Fields: Probabilistic Models\nfor Segmenting and Labeling Sequence Data. In International Conference on Machine Learning,\npages 282\u2013289, 2001.\n\n[32] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization\nby ef\ufb01cient subwindow search. In Conference on Computer Vision and Pattern Recognition,\npages 1\u20138, 2008.\n\n[33] N. Le Roux, M. W. Schmidt, and F. R. Bach. A Stochastic Gradient Method with an Exponential\nConvergence Rate for Strongly-Convex Optimization with Finite Training Sets. In Advances in\nNeural Information Processing Systems, pages 2672\u20132680, 2012.\n\n[34] M. Lewis and M. Steedman. A* CCG parsing with a supertag-factored model. In Conference\n\non Empirical Methods in Natural Language Processing, pages 990\u20131000, 2014.\n\n[35] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization.\n\nAdvances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\nIn\n\n[36] H. Lin, J. Mairal, and Z. Harchaoui. Catalyst Acceleration for First-order Convex Optimization:\n\nfrom Theory to Practice. Journal of Machine Learning Research, 18(212):1\u201354, 2018.\n\n[37] J. Mairal. Incremental majorization-minimization optimization with application to large-scale\n\nmachine learning. SIAM Journal on Optimization, 25(2):829\u2013855, 2015.\n\n11\n\n\f[38] A. F. T. Martins and R. F. Astudillo. From Softmax to Sparsemax: A Sparse Model of Attention\nIn International Conference on Machine Learning, pages\n\nand Multi-Label Classi\ufb01cation.\n1614\u20131623, 2016.\n\n[39] A. Mensch and M. Blondel. Differentiable dynamic programming for structured prediction and\n\nattention. In International Conference on Machine Learning, pages 3459\u20133468, 2018.\n\n[40] O. Meshi, D. Sontag, T. S. Jaakkola, and A. Globerson. Learning Ef\ufb01ciently with Approximate\nInference via Dual Losses. In International Conference on Machine Learning, pages 783\u2013790,\n2010.\n\n[41] O. Meshi, T. S. Jaakkola, and A. Globerson. Convergence Rate Analysis of MAP Coordinate\nIn Advances in Neural Information Processing Systems, pages\n\nMinimization Algorithms.\n3023\u20133031, 2012.\n\n[42] Y. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on\n\nOptimization, 16(1):235\u2013249, 2005.\n\n[43] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103\n\n(1):127\u2013152, 2005.\n\n[44] V. Niculae, A. F. Martins, M. Blondel, and C. Cardie. SparseMAP: Differentiable Sparse\nStructured Inference. In International Conference on Machine Learning, pages 3796\u20133805,\n2018.\n\n[45] D. Nilsson. An ef\ufb01cient algorithm for \ufb01nding the M most probable con\ufb01gurations in proba-\n\nbilistic expert systems. Statistics and computing, 8(2):159\u2013173, 1998.\n\n[46] A. Osokin, J.-B. Alayrac, I. Lukasewitz, P. Dokania, and S. Lacoste-Julien. Minding the gaps for\nblock Frank-Wolfe optimization of structured SVMs. In International Conference on Machine\nLearning, pages 593\u2013602, 2016.\n\n[47] B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems.\n\nIn Advances in Neural Information Processing Systems, pages 1408\u20131416, 2016.\n\n[48] K. Pillutla, V. Roulet, S. M. Kakade, and Z. Harchaoui. A Smoother Way to Train Structured\n\nPrediction Models. arXiv preprint, 2019.\n\n[49] N. D. Ratliff, J. A. Bagnell, and M. Zinkevich.\n\n(Approximate) Subgradient Methods for\nStructured Prediction. In International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 380\u2013387, 2007.\n\n[50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition\nChallenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[51] B. Savchynskyy, J. H. Kappes, S. Schmidt, and C. Schn\u00f6rr. A study of Nesterov\u2019s scheme for\nLagrangian decomposition and MAP labeling. In Conference on Computer Vision and Pattern\nRecognition, pages 1817\u20131823, 2011.\n\n[52] M. Schmidt, R. Babanezhad, M. Ahmed, A. Defazio, A. Clifton, and A. Sarkar. Non-uniform\nstochastic average gradient method for training conditional random \ufb01elds. In International\nConference on Arti\ufb01cial Intelligence and Statistics, pages 819\u2013828, 2015.\n\n[53] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[54] B. Seroussi and J. Golmard. An algorithm directly \ufb01nding the K most probable con\ufb01gurations\nin Bayesian networks. International Journal of Approximate Reasoning, 11(3):205 \u2013 233, 1994.\n\n[55] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Journal of Machine Learning Research, 14(Feb):567\u2013599, 2013.\n\n12\n\n\f[56] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for\nregularized loss minimization. In International Conference on Machine Learning, pages 64\u201372,\n2014.\n\n[57] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient\n\nsolver for SVM. Mathematical programming, 127(1):3\u201330, 2011.\n\n[58] H. O. Song, R. B. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to\n\nlocalize objects with minimal supervision. pages 1611\u20131619, 2014.\n\n[59] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Advances in Neural\n\nInformation Processing Systems, pages 25\u201332, 2004.\n\n[60] B. Taskar, S. Lacoste-Julien, and D. Klein. A discriminative matching approach to word\nalignment. In Human Language Technology Conference and Conference on Empirical Methods\nin Natural Language Processing, pages 73\u201380, 2005.\n\n[61] B. Taskar, S. Lacoste-Julien, and M. I. Jordan. Structured prediction, dual extragradient and\n\nBregman projections. Journal of Machine Learning Research, 7(Jul):1627\u20131653, 2006.\n\n[62] C. H. Teo, S. Vishwanathan, A. Smola, and Q. V. Le. Bundle methods for regularized risk\n\nminimization. Journal of Machine Learning Research, 1(55), 2009.\n\n[63] E. F. Tjong Kim Sang and F. De Meulder.\n\nIntroduction to the CoNLL-2003 shared task:\nLanguage-independent named entity recognition. In Conference on Natural Language Learning,\npages 142\u2013147, 2003.\n\n[64] M. Tkachenko and A. Simanovsky. Named entity recognition: Exploring features. In Empirical\n\nMethods in Natural Language Processing, pages 118\u2013127, 2012.\n\n[65] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for\ninterdependent and structured output spaces. In International Conference on Machine Learning,\npage 104, 2004.\n\n[66] K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective\nsearch for object recognition. In International Conference on Computer Vision, pages 1879\u2013\n1886, 2011.\n\n[67] B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives.\n\nIn Advances in Neural Information Processing Systems, pages 3639\u20133647, 2016.\n\n[68] C. Yanover and Y. Weiss. Finding the M most probable con\ufb01gurations using loopy belief\n\npropagation. In Advances in Neural Information Processing Systems, pages 289\u2013296, 2004.\n\n[69] X. Zhang, A. Saha, and S. Vishwanathan. Accelerated training of max-margin markov networks\n\nwith kernels. Theoretical Computer Science, 519:88\u2013102, 2014.\n\n13\n\n\f", "award": [], "sourceid": 2307, "authors": [{"given_name": "Venkata Krishna", "family_name": "Pillutla", "institution": "University of Washington"}, {"given_name": "Vincent", "family_name": "Roulet", "institution": "UW"}, {"given_name": "Sham", "family_name": "Kakade", "institution": "University of Washington"}, {"given_name": "Zaid", "family_name": "Harchaoui", "institution": "University of Washington"}]}