{"title": "Convex Two-Layer Modeling with Latent Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 1280, "page_last": 1288, "abstract": "Unsupervised learning of structured predictors has been a long standing pursuit in machine learning. Recently a conditional random field auto-encoder has been proposed in a two-layer setting, allowing latent structured representation to be automatically inferred. Aside from being nonconvex, it also requires the demanding inference of normalization. In this paper, we develop a convex relaxation of two-layer conditional model which captures latent structure and estimates model parameters, jointly and optimally. We further expand its applicability by resorting to a weaker form of inference---maximum a-posteriori. The flexibility of the model is demonstrated on two structures based on total unimodularity---graph matching and linear chain. Experimental results confirm the promise of the method.", "full_text": "Convex Two-Layer Modeling with Latent Structure\n\nVignesh Ganapathiraman\u2020,\n\nJunfeng Wen(cid:93)\n\nYaoliang Yu\u2217,\n\u2020University of Illinois at Chicago, Chicago, IL, USA\n\nXinhua Zhang\u2020,\n\n\u2217University of Waterloo, Waterloo, ON, Canada,\n(cid:93)University of Alberta, Edmonton, AB, Canada\n{vganap2, zhangx}@uic.edu, yaoliang.yu@uwaterloo.ca, junfengwen@gmail.com\n\nAbstract\n\nUnsupervised learning of structured predictors has been a long standing pursuit\nin machine learning. Recently a conditional random \ufb01eld auto-encoder has been\nproposed in a two-layer setting, allowing latent structured representation to be\nautomatically inferred. Aside from being nonconvex, it also requires the demanding\ninference of normalization.\nIn this paper, we develop a convex relaxation of\ntwo-layer conditional model which captures latent structure and estimates model\nparameters, jointly and optimally. We further expand its applicability by resorting\nto a weaker form of inference\u2014maximum a-posteriori. The \ufb02exibility of the model\nis demonstrated on two structures based on total unimodularity\u2014graph matching\nand linear chain. Experimental results con\ufb01rm the promise of the method.\n\nIntroduction\n\n1\nOver the past decade deep learning has achieved signi\ufb01cant advances in many application areas [1].\nBy automating the acquisition of latent descriptive and predictive representation, they provide highly\neffective models to capture the relationships between observed variables. Recently more re\ufb01ned deep\nmodels have been proposed for structured output prediction, where several random variables for\nprediction are statistically correlated [2\u20134]. Improved performance has been achieved in applications\nsuch as image recognition and segmentation [5] and natural language parsing [6], amongst others.\nSo far, most deep models for structured output are designed for supervised learning where structured\nlabels are available. Recently an extension has been made to the unsupervised learning. [7] proposed\na conditional random \ufb01eld auto-encoder (CRF-AE)\u2014a two-layer conditional model\u2014where given\nthe observed data x, the latent structure y is \ufb01rst generated based on p(y|x), and then applied to\nreconstruct the observations using p(x|y). The motivation is to \ufb01nd the predictive and discriminative\n(rather than common but irrelevant) latent structure in the data. Along similar lines, several other\ndiscriminative unsupervised learning methods are also available [8\u201311].\nExtending auto-encoders X \u2192 Y \u2192 X to general two-layer models X \u2192 Y \u2192 Z is not hard. [12, 13]\naddressed transliteration between two languages, where Z is the observed binary label indicating if\ntwo words match, and higher accuracy can be achieved if we faithfully recover a letter-wise matching\nrepresented by the unobserved structure Y . In essence, their model optimizes p(z|arg maxy p(y|x)),\nuncovering the latent y via its mode under the \ufb01rst layer model. This is known as bi-level optimization\nbecause the arg max of inner optimization is used. A soft variant adopts the mean of y [14]. In\ngeneral, conditional models yield more accurate predictions than generative models X\u2212Y \u2212Z (e.g.\nmulti-wing harmoniums/RBMs), unless the latter is trained in a discriminative fashion [15].\nIn computation, all methods require certain forms of tractability in inference. CRF-AE leverages\nmarginal inference on p(y|x)p(x|y) (over y) for EM. Contrastive divergence, instead, samples\nfrom p(y|x) [11]. For some structures like graph matching, neither of them is tractable [16, 17]\n(unless assuming \ufb01rst-order Markovian). In single-layer models, this challenge has been resolved\nby max-margin estimation, which relies only on the MAP of p(y|x) [18]. This oracle is much less\ndemanding than sampling or normalization, as \ufb01nding the most likely state can be much easier than\nsummarizing over all possible y. For example, MAP for graph matching can be solved by max-\ufb02ow.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fUnfortunately a direct extension of max-margin estimation to two-layer modeling meets with imme-\ndiate obstacles, because here one has to solve maxy p(y|x)p(z|y). In general, p(z|y) depends on y\nin a highly nonlinear form, making this augmented MAP inference intractable. This seems to leave\nthe aforementioned bi-level optimization the only option that retains the sole dependency on MAP.\nHowever, solving this optimization poses substantial challenge when y is discrete, because the mode\nof p(y|x) is almost always invariant to small perturbations of model parameters (i.e. zero gradient).\nIn this paper we demonstrate that this optimization can be relaxed into a convex formulation while\nstill preserving suf\ufb01cient regularities to recover a non-trivial, nonlinear predictive model that supports\nstructured latent representations. Recently a growing body of research has investigated globally\ntrainable deep models. But they remain limited. [19] formulated convex conditional models using\nlayer-wise kernels, connected through nonlinear losses. However these losses are data dependent,\nnecessitating a transductive setting to retain the context. [20] used boosting but the underlying oracle\nis generally intractable. Speci\ufb01c global methods were also proposed for polynomial networks [21]\nand sum-product networks [22]. None of these methods accommodate structures in latent layers.\nOur convex formulation is achieved by enforcing the \ufb01rst-order optimality conditions of inner level\noptimization via sublinear constraints. Using a semi-de\ufb01nite relaxation, we amount to the \ufb01rst\ntwo-layer model that allows latent structures to be inferred concurrently with model optimization\nwhile still admitting globally optimal solutions (\u00a73). To the best of our knowledge, this is the \ufb01rst\nalgorithm in machine learning that directly constructs a convex relaxation for a bi-level optimization\nbased on the inner optimality conditions. Unlike [19], it results in a truly inductive model, and its\n\ufb02exibility is demonstrated with two example structures in the framework of total unimodularity (\u00a74).\nThe only inference required is MAP on p(y|x), and the overall scalability is further improved by a\nre\ufb01ned optimization algorithm (\u00a75). Experimental results demonstrate its useful potentials in practice.\n\nq0(y) =(cid:74)y \u2208 Y(cid:75) .\n\n2 Preliminaries and Background\nWe consider a two-layer latent conditional model X \u2192 Y \u2192 Z, where X is the input, Z is the\noutput, and Y is a latent layer composed of h random variables: {Yi}h\ni=1. Instead of assuming no\ninterdependency between Yi as in [19], our major goal here is to model the structure in the latent layer\nY . Speci\ufb01cally, we assume a conditional model for the \ufb01rst layer based on an exponential family\n\np(y|x) = q0(y) exp(\u2212y(cid:48)U x \u2212 \u2126(U x)), where\n\nwith(cid:74)x(cid:75) = 1 if x is true, and 0 otherwise. The correlation among Yi is instilled by the support set\n\n(1)\nHere U is the \ufb01rst layer weight matrix, and \u2126 is the log-partition function. q0(y) is the base measure,\nY, which plays a central role here. For example, when Y consists of all h-dimensional canonical\nvectors, p(y|x) recovers the logistic multiclass model. In general, to achieve a tradeoff between\ncomputational ef\ufb01ciency and representational \ufb02exibility, we make the following assumptions on Y:\nAssumption 1 (PO-tractable). We assume Y is bounded, and admits an ef\ufb01cient polar operator. That\nis, for any vector d \u2208 Rh, miny\u2208Y d(cid:48)y is ef\ufb01ciently solvable.\nNote the support set Y (hence the base measure q0) is \ufb01xed and does not contain any more parameter.\nPO-tractability is available in a variety of applications, and we give two examples here.\ni=1 and {bj}n\nGraph matching.\nj=1, each edge\nbetween ai and bj has a weight Tij. The task is to \ufb01nd a one-to-one mapping (can be extended)\nbetween {ai} and {bj}, such that the sum of weights on the edges is maximized. Denote the matching\nby Y \u2208 {0, 1}n\u00d7n where Yij = 1 iff the edge (ai, bj) is selected. So the optimal matching is the mode\nGraphical models. For simplicity, consider a linear chain model V1 \u2212 V2 \u2212 \u00b7\u00b7\u00b7 \u2212 Vp. Here each\nVi can take one of C possible values, which we encode using the C-dimensional canonical basis\nvi. Suppose there is a node potential mi \u2208 RC for each Vi, and each edge (Vi, Vi+1) has an edge\npotential Mi \u2208 RC\u00d7C. Then we could directly de\ufb01ne a distribution on {Vi}. Unfortunately, it\nwill involve quadratic terms such as v(cid:48)\niMivi+1, and so a different parameterization is in order. Let\nYi \u2208 {0, 1}C\u00d7C encode the values of (Vi, Vi+1) via row and column indices of Yi respectively. Then\nthe distribution on {Vi} can be equivalently represented by a distribution on {Yi}:\n\nof p(Y ) \u221d(cid:74)Y \u2208 Y(cid:75) exp(tr(Y (cid:48)T )) where the support is Y = {Y \u2208 {0, 1}n\u00d7n : Y 1 = Y (cid:48)1 = 1}.\n\nIn a bipartite graph with two sets of vertices {ai}n\n\n(cid:19)\nwhere Y =(cid:8){Yi} : Yi \u2208 {0, 1}C\u00d7C(cid:9) \u2229 H, with H := {{Yi} : 1(cid:48)Yi1 = 1, Y (cid:48)\n\np({Yi}) \u221d(cid:74){Yi} \u2208 Y(cid:75) exp\n\n(cid:18)(cid:88)p\n\n(cid:88)p\u22121\n\ntr(M(cid:48)\n\niYi1 +\n\ni Yi)\n\nm(cid:48)\n\ni=1\n\ni=1\n\n,\n\n(2)\ni 1 = Yi+11} . (3)\n\n2\n\n\fp(z|y) = exp(z(cid:48)R(cid:48)y \u2212 G(R(cid:48)y))q1(z) = exp(\u2212DG\u2217 (z||\u2207G(R(cid:48)y)) + G\u2217(z))q1(z),\n\nThe constraints in H encode the obvious consistency constraints between overlapping edges. This\nmodel ultimately falls into our framework in (1).\nIn both examples, the constraints in Y are totally unimodular (TUM), and therefore the polar operator\ncan be computed by solving a linear programming (LP), with the {0, 1} constraints relaxed to [0, 1].\nIn \u00a74.1 and 4.2 we will generalize y(cid:48)U x to y(cid:48)d(U x), where d is an af\ufb01ne function of U x that allows\nfor homogeneity in temporal models. For clarity, we \ufb01rst develop a general framework using y(cid:48)U x.\nOutput layer As for the output layer, we assume a conditional model from an exponential family\n(4)\nwhere G is a smooth and strictly convex function, and DG\u2217 is the Bregman divergence induced by the\nFenchel dual G\u2217. Such a parameterization is justi\ufb01ed by the equivalence between regular Bregman\ndivergence and regular exponential family [23]. Thanks to the convexity of G, it is trivial to extend\np(z|y) to y \u2208 convY (the convex hull of Y), and G(R(cid:48)y) will still be convex over convY (\ufb01xing R).\nFinally we highlight the assumptions we make and do not make. First we only assume PO-tractability\nof Y, hence tractability in MAP inference of p(y|x). We do not assume it is tractable to compute the\nnormalizer \u2126 or its gradient (marginal distributions). We also do not assume that unbiased samples\nof y can be drawn ef\ufb01ciently from p(y|x). In general, PO-tractability is a weaker assumption. For\nexample, in graph matching MAP inference is tractable while marginalization is NP-hard [16] and\nsampling requires MCMC [24]. Finally, we do not assume tractability of any sort for p(y|x)p(z|y)\n(in y), and so it may be hard to solve miny\u2208Y{d(cid:48)y + G(R(cid:48)y) \u2212 z(cid:48)R(cid:48)y}, as G is generally not af\ufb01ne.\n2.1 Training principles\nAt training time, we are provided with a set of feature-label pairs (x, z) \u223c \u02dcp, where \u02dcp is the empirical\ndistribution. In the special case of auto-encoder, z is tied with x. The \u201cbootstrapping\" style estimation\n[25] optimizes the joint likelihood with the latent y imputed in an optimistic fashion\n\n(cid:21)\ny\u2208Y y(cid:48)U x + \u2126(U x) \u2212 z(cid:48)R(cid:48)y + G(R(cid:48)y)\nmin\nU,R\nThis results in a hard EM estimation, and a soft version can be achieved by adding entropic regularizers\non y. Regularization can be imposed on U and R which we will make explicit later (e.g. bounding the\nL2 norm). Since the log-partition function \u2126 in p(y|x) is hard to compute, the max-margin approach\nis introduced which replaces \u2126(U x) by an upper bound max\u02c6y\u2208Y \u2212\u02c6y(cid:48)U x, leading to a surrogate loss\n(5)\n\n\u2212z(cid:48)R(cid:48)y + G(R(cid:48)y) + y(cid:48)U x \u2212 min\n\ny\u2208Y \u2212 log p(y|x)p(z|y)\n\n(cid:27)(cid:21)\n\n= min\nU,R\n\n(x,z)\u223c \u02dcp\n\n(x,z)\u223c \u02dcp\n\n(cid:26)\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:20)\n\nmin\n\nmin\n\nE\n\nE\n\n.\n\n.\n\nmin\nU,R\n\nE\n\n(x,z)\u223c \u02dcp\n\nmin\ny\u2208Y\n\n\u02c6y\u2208Y \u02c6y(cid:48)U x\n\nHowever, the key disadvantage of this method is the augmented inference on y, because we have only\nassumed the tractability of miny\u2208Y d(cid:48)y for all d, not miny\u2208Y{y(cid:48)d + G(R(cid:48)y)\u2212 z(cid:48)R(cid:48)y}. In addition,\nthis principle intrinsically determines the latent y as a function of both the input and the output,\nwhile at test time the output itself is unknown and is the subject of prediction. The common practice\ntherefore requires a joint optimization over y and z at test time, which is costly in computation.\nThe goal of this paper is to design a convex formulation in which the latent y is completely de-\ntermined by the input x, and both the prediction and estimation rely only on the polar operator:\narg miny\u2208Y y(cid:48)U x. As a consequence of this goal, it is natural to postulate that the y found this way\nrenders an accurate prediction of z, or a faithful recovery of x in auto-encoders. This idea, which has\nbeen employed by [e.g., 9, 26], leads to the following bi-level optimization problem\n\n(cid:20)\n\n(cid:12)(cid:12)(cid:12)arg max\n\n(cid:18)\nz\nlog p\n[\u2212z(cid:48)R(cid:48)y\u2217\nx + G(R(cid:48)y\u2217\n\n(cid:19)(cid:21)\ny\u2208Y p(y|x)\n\n\u21d4 max\n\n(cid:20)\nE\n(x,z)\u223c \u02dcp\ny\u2208Y y(cid:48)U x.\nx = arg min\n\nE\n\nE\n\n(x,z)\u223c \u02dcp\n\nmax\nU,R\n\u21d4 min\nDirectly solving this optimization problem is challenging, because the optimal y\u2217\nx is almost surely\ninvariant to small perturbations of U (e.g. when Y is discrete). So a zero valued gradient is witnessed\nalmost everywhere. Therefore a more carefully designed optimization algorithm is in demand.\n\nx)] , where y\u2217\n\n(x,z)\u223c \u02dcp\n\nlog p\n\n(7)\n\nU,R\n\nU,R\n\nz\n\n(6)\n\n(cid:18)\n\n(cid:12)(cid:12)(cid:12)arg min\n\ny\u2208Y y(cid:48)U x\n\n(cid:19)(cid:21)\n\n3 A General Framework of Convexi\ufb01cation\n\nWe propose addressing this bi-level optimization by convex relaxation, and it is built upon the\n\ufb01rst-order optimality conditions of the inner-level optimization. First notice that the set Y participates\n\n3\n\n\fw(cid:48)U x +\n\n(cid:107)w(cid:107)2 .\n\n\u03c3\n2\n\nin the problem (7) only via the polar operator at U x: arg miny\u2208Y y(cid:48)U x. If Y is discrete, this\nproblem is equivalent to optimizing over S := convY, because a linear function on a convex set is\nalways optimized on its extreme points. Clearly, S is convex, bounded, closed, and is PO-tractable.\nIt is important to note that the origin is not necessarily contained in S. To remove the potential\nnon-uniqueness of the minimizer in (7), we next add a small proximal term to the polar operator\nproblem (\u03c3 is a small positive number):\n\nmin\nw\u2208S\n\n(8)\nThis leads to a small change in the problem and makes sure that the minimizer is unique.1 Adding\nstrongly convex terms to the primal and dual objectives is a commonly used technique for accelerated\noptimization [27], and has been used in graphical model inference [e.g., 28]. We intentionally\nchanged the symbol y into w, because here the optimal w is not necessarily in Y.\nBy the convexity of the problem (8) and noting that the gradient of the objective is U x + \u03c3w, w is\noptimal if and only if\n\n(9)\nThese optimality conditions can be plugged into the bi-level optimization problem (7). Introducing\n\u201cLagrange multipliers\u201d (\u03b3, \u02c6\u03b8) to enforce the latter condition via a mini-max formulation, we obtain\n(10)\n\n\u2212z(cid:48)R(cid:48)w + v(cid:48)R(cid:48)w \u2212 G\u2217(v)\n\nand (U x + \u03c3w)(cid:48)( \u02c6\u03b8 \u2212 w) \u2265 0,\n\n\u2200 \u02c6\u03b8 \u2208 S.\n\nw \u2208 S,\n\nmax\n\nmax\n\n(cid:104)\n\nmin\n\nE\n\nmin(cid:107)U(cid:107)\u22641\n\nmin(cid:107)R(cid:107)\u22641\n\n(x,z)\u223c \u02dcp\n\nw\n\n\u03b3\u22650, \u02c6\u03b8\u2208S\n\nv\n\n(11)\nwhere \u03b9S is the {0,\u221e}-valued indicator function of the set S. Here we dualized G via G(R(cid:48)w) =\nmaxv v(cid:48)R(cid:48)w \u2212 G\u2217(v), and made explicit the Frobenius norm constraints ((cid:107)\u00b7(cid:107)) on U and R.2\nApplying change of variable \u03b8 = \u03b3 \u02c6\u03b8, the constraints \u03b3 \u2265 0 and \u02c6\u03b8 \u2208 S (a convex set) become\n\n+ \u03b9S(w) + \u03b3(U x + \u03c3w)(cid:48)(w \u2212 \u02c6\u03b8)\n\n,\n\n(cid:105)\n\n(\u03b8, \u03b3) \u2208 N := cone{( \u02c6\u03b8, 1) : \u02c6\u03b8 \u2208 S},\n\nwhere cone stands for the conic hull (convex). Similarly we can dualize \u03b9S(w) = max\u03c0 \u03c0(cid:48)w\u2212\u03c3S(\u03c0),\nwhere \u03c3S(\u03c0) := maxw\u2208S \u03c0(cid:48)w is the support function on S. Now swap minw with all the subsequent\nmax (strong duality), we arrive at a form where w can be minimized out analytically\n\n(cid:104)\n(cid:104)\n\nmin(cid:107)U(cid:107)\u22641\n\nmin(cid:107)R(cid:107)\u22641\n\nE\n\n(x,z)\u223c \u02dcp\n\n= min(cid:107)U(cid:107)\u22641\n\nmin(cid:107)R(cid:107)\u22641\n\nE\n\n(x,z)\u223c \u02dcp\n\nmax\n\n\u03c0\n\nmax\n(\u03b8,\u03b3)\u2208N max\n\nv\n\nmax\n\n\u03c0\n\nmax\n(\u03b8,\u03b3)\u2208N max\n\nv\n\n(cid:105)\n\n(12)\n\n(13)\n(14)\n(15)\n\nw\n\n\u2212z(cid:48)R(cid:48)w + v(cid:48)R(cid:48)w \u2212 G\u2217(v)\n\nmin\n+ \u03c0(cid:48)w \u2212 \u03c3S(\u03c0) + (U x + \u03c3w)(cid:48)(\u03b3w \u2212 \u03b8)\n\u2212G\u2217(v) \u2212 \u03c3S(\u03c0) \u2212 \u03b8(cid:48)U x\n\u2212 1\n\n4\u03c3\u03b3 (cid:107)R(v \u2212 z) + \u03b3U x + \u03c0 \u2212 \u03c3\u03b8(cid:107)2(cid:3).\n\uf8f6\uf8f8 .\n\uf8eb\uf8edM1 Mu Mr\n\n(cid:33)\n\n=:\n\n(cid:32) I\n\n(cid:33)\n\n(cid:32) I\n\nGiven (U, R), the optimal (v, \u03c0, \u03b8, \u03b3) can be ef\ufb01ciently solved through a concave maximization.\nHowever the overall objective is not convex in (U, R) because the quadratic term in (15) is subtracted.\nFortunately it turns out not hard to tackle this issue by using semi-de\ufb01nite programming (SDP)\nrelaxation which linearizes the quadratic terms. In particular, let I be the identity matrix, and de\ufb01ne\n\nU\n\nU(cid:48)\nR(cid:48)\n\nR\nU(cid:48) U(cid:48)U U(cid:48)R\nR(cid:48) R(cid:48)U R(cid:48)R\n\n(I, U, R) =\n\nM := M (U, R) :=\n\nu Mu,u M(cid:48)\nr,u\nr Mr,u Mr,r\nThen \u03b8(cid:48)U x can be replaced by \u03b8(cid:48)Mux and the quadratic term in (15) can be expanded as\nf (M, \u03c0, \u03b8, \u03b3, v; x, z) := tr(Mr,r(v \u2212 z)(v \u2212 z)(cid:48)) + \u03b32 tr(Mu,uxx(cid:48)) + 2\u03b3 tr(Mr,ux(v \u2212 z)(cid:48))\n(17)\nSince given (\u03c0, \u03b8, \u03b3, v) the objective function becomes linear in M, so after maximizing out these\nvariables the overall objective is convex in M. Although this change of variable turns the objective\ninto convex, it indeed shifts the intractability into the feasible region of M:\n\n+ 2(\u03c0 \u2212 \u03c3\u03b8)(cid:48)(Mr(v \u2212 z) + \u03b3Mux) + (cid:107)\u03c0 \u2212 \u03c3\u03b8(cid:107)2 .\n\nM(cid:48)\nM(cid:48)\n\n(16)\n\nM0 := {M (cid:23) 0 : M1 = I, tr(Mu,u) \u2264 1, tr(Mr,r) \u2264 1}\n\n\u2229 {M : rank(M ) = h} .\n\n(18)\n\n1If p(y|x) \u221d p0(y) exp(\u2212y(cid:48)U x\u2212 \u03c3\n\n2 (cid:107)w(cid:107)2 term.\nIn this case, all our subsequent developments apply directly. Therefore our approach applies to a broader setting\nwhere L2 projection to S is tractable, but here we focus on PO-tractability just for the clarity of presentation.\n2To simplify the presentation, we bound the radius by 1 while in practice it is a hyperparameter to be tuned.\n\n=:M1\n2 (cid:107)y(cid:107)2) (for any \u03c3 > 0), then there is no need to add this \u03c3\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:125)\n\n4\n\n\fHere M (cid:23) 0 means M is real symmetric and positive semi-de\ufb01nite. Due to the rank constraint, M0\nis not convex. So a natural relaxation\u2014the only relaxation we introduce besides the proximal term in\n(8)\u2014is to drop this rank constraint and optimize with the resulting convex set M1. This leads to the\n\ufb01nal convex formulation:\n\nE\n\nmin\nM\u2208M1\n\n(x,z)\u223c \u02dcp\n\nmax\n\n\u03c0\n\n(\u03b8,\u03b3)\u2208N max\nmax\n\nv\n\n\u2212G\u2217(v) \u2212 \u03c3S(\u03c0) \u2212 \u03b8(cid:48)Mux \u2212 1\n\n4\u03c3\u03b3 f (M, \u03c0, \u03b8, \u03b3, v; x, z)\n\n. (19)\n\n(cid:104)\n\n(cid:105)\n\nTo summarize, we have achieved a convex model for two-layer conditional models in which the\nlatent structured representation is determined by a polar operator. Instead of bypassing this bi-level\noptimization via the normal loss based approach [e.g., 19, 29], we addressed it directly by leveraging\nthe optimality conditions of the inner optimization. A convex relaxation is then achieved via SDP.\n\n3.1\n\nInducing low-rank solutions of relaxation\n\nAlthough it is generally hard to provide a theoretical guarantee for nonlinear SDP relaxations, it is\ninteresting to note that the constraint set M1 effectively encourages low-rank solutions (hence tighter\nrelaxations). As a key technical result, we next show that all extreme points of M1 are rank h (the\nnumber of hidden nodes) for all h \u2265 2. Recall that in sparse coding, the atomic norm framework [30]\ninduces low-complexity solutions by setting up the optimization over the convex hull of atoms, or\npenalize via its gauge function. Therefore the characterization of the extreme points of M1 might\nopen up the possibility of analyzing our relaxation by leveraging the results in sparse coding.\nLemma 1. Let Ai be symmetric matrices. Consider the set of\n\nR := {X : X (cid:23) 0, tr(AiX) (cid:83) bi, i = 1, . . . , m},\n\n(20)\nwhere m is the number of linear (in)equality constraints. (cid:83) means it can be any one of \u2264, =, or \u2265.\nThen the rank r of all extreme points of R is upper bounded by\n\n(21)\nThis result extends [31] by accommodating inequalities in (20), and its proof is given in Appendix A.\nNow we show that the feasible region M1 as de\ufb01ned by (18) has all extreme points with rank h.\nTheorem 1. If h \u2265 2, then all extreme points of M1 have rank h, and M1 is the convex hull of M0.\nProof. Let M be an extreme point of M1. Noting that M (cid:23) 0 already encodes the symmetry of M,\nthe linear constraints for M1 in (18) can be written as 1\n2 h(h + 1) linear equality constraints and two\nlinear inequality constraints. In total m = 1\n\n2 h(h + 1) + 2. Plug it into (21) in the above lemma\n\nrank(M ) \u2264(cid:4) 1\n\n8m + 1 \u2212 1)(cid:5) =\n\n\u221a\n\n(22)\nFinally, the identity matrix in the top-left corner of M forces rank(M ) \u2265 h. So rank(M ) = h for\nall h \u2265 2. It then follows that M1 = convM0.\n\n2 (\n\n(cid:106) 1\n(cid:107)\n2 ((cid:112)4h(h + 1) + 17 \u2212 1)\n\n= h +(cid:74)h = 1(cid:75).\n\nr \u2264(cid:4) 1\n\n\u221a\n2 (\n\n8m + 1 \u2212 1)(cid:5) .\n\n4 Application in Machine Learning Problems\nThe framework developed above is generic. For example, when Y represents classi\ufb01cation for h\nclasses by canonical vectors, S = convY is the h dimensional probability simplex (sum up to 1).\nClearly \u03c3S(\u03c0) = maxi \u03c0i, and N = {(x, t) \u2208 Rh+1\n: 1(cid:48)x = t}. In many applications, Y can be\ncharacterized as {y \u2208 {0, 1}h : Ay \u2264 c}, where A is TUM and all entries of c are in {\u22121, 1, 0}.3 In\nthis case, the convex hull S has all extreme points being integral, and S employs an explicit form:\n\nY = {y \u2208 {0, 1}h : Ay \u2264 c}\n\n(23)\nreplacing all binary constraints {0, 1} by intervals [0, 1]. Clearly TUM is a suf\ufb01cient condition for\nPO-tractability, because miny\u2208Y d(cid:48)y is equivalent to minw\u2208S d(cid:48)w, an LP. Examples include the\nabove graph matching and linear chain model. We will refer to Aw \u2264 c as the non-box constraint.\n\nS = convY = {w \u2208 [0, 1]h : Aw \u2264 c},\n\n=\u21d2\n\n+\n\n4.1 Graph matching\n\nAs the \ufb01rst concrete example, we consider convex relaxation for latent graph matching. One task\nin natural language is transliteration [12, 32]. Suppose we are given an English word e with m\nletters, and a corresponding Hebrew word h with n letters. The goal is to predict whether e and h are\nphonetically similar, a binary classi\ufb01cation problem with z \u2208 {\u22121, 1}. However it obviously helps to\n3For simplicity, we write equality constraints (handled separately in practice) using two inequality constraints.\n\n5\n\n\f\u02c6n(cid:88)\n\n\u02c6m(cid:88)\n\n\ufb01nd, as an intermediate step, the letter-wise matching between e and h. The underlying assumption is\nthat each letter corresponds to at most one letter in the word of the other language. So if we augment\nboth e and h with a sink symbol * at the end (hence making their length \u02c6m := m + 1 and \u02c6n := n + 1\nrespectively), we would like to \ufb01nd a matching y \u2208 {0, 1} \u02c6m\u02c6n that minimizes the following cost\nYiju(cid:48)\u03c6ij, where Y = {0, 1} \u02c6m\u00d7\u02c6n \u2229 {Y : Yi,:1 = 1,\u2200i \u2264 m, 1(cid:48)Y:,j = 1,\u2200j \u2264 n}\n\nmin\nY \u2208Y\nHere Yi,: is the i-th row of Y . \u03c6ij \u2208 Rp is a feature vector associated with the pair of i-th letter in e\nand j-th letter in h, including the dummy *. Our notation omits its dependency on e and h. u is a\ndiscriminative weight vector that will be learned from data. After \ufb01nding the optimal Y \u2217, [12] uses\n\nthe maximal objective value of (24) to make the \ufb01nal binary prediction: \u2212 sign((cid:80)\n\n(cid:123)(cid:122)\n\n. (24)\n\n(cid:124)\n\n(cid:125)\n\n=:G\n\nj=1\n\ni=1\n\nij Y \u2217\n\niju(cid:48)\u03c6ij).\n\nTo pose the problem in our framework, we \ufb01rst notice that the non-box constraints G in (24) are TUM.\nTherefore, S is simply [0, 1] \u02c6m\u00d7\u02c6n \u2229 G. Given the decoded w, the output labeling principle above\nessentially duplicates u as the output layer weight. A key advantage of our method is to allow the\nweights of the two layers to be decoupled. By using a weight vector r \u2208 Rp, we de\ufb01ne the output\nscore as r(cid:48)\u03a6w, where \u03a6 is a p-by- \u02c6m\u02c6n matrix whose (i, j)-th column is \u03c6ij. So \u03a6 depends on e and\nh. Overall, our model follows by instantiating (12) as:\n\n(cid:104)\n\nmin(cid:107)U(cid:107)\u22641\n\nmin(cid:107)R(cid:107)\u22641\n\nE\n\n(e,h,z)\u223c \u02dcp\n\nmax\n\n\u03c0\n\n(\u03b8,\u03b3)\u2208N max\nmax\n\nv\u2208R min\n\nw\n\n\u2212zr(cid:48)\u03a6w + vr(cid:48)\u03a6w \u2212 G\u2217(v) + \u03c0(cid:48)w\n\n(26)\nOnce more we can minimize out w, which gives rise to a quadratic (cid:107)(v \u2212 z)\u03a6(cid:48)r + \u03b3\u03a6(cid:48)u + \u03c0 \u2212 \u03c3\u03b8(cid:107)2.\nIt is again amenable to SDP relaxation, where (Mu,u, Mr,u, Mr,r) correspond to (uu(cid:48), ru(cid:48), rr(cid:48)) resp.\n\nij\n\n.\n\n(u(cid:48)\u03c6ij + \u03c3wij)(\u03b3wij \u2212 \u03b8ij)\n\n\u2212 \u03c3S(\u03c0) +\n\n(cid:88)\n\n(cid:105)\n\n(25)\n\n4.2 Homogeneous temporal models\n\ni=1\n\n1(cid:48)Y (cid:48)\n\ntr(U(cid:48)\n\ni Uvxi +\n\n(cid:88)p\n\n(cid:88)p\u22121\n\nA variety of structured output problems are formulated with graphical models. We highlight the gist of\nour technique by using a concrete example: unsupervised structured learning for inpainting. Suppose\nwe are given images of handwritten words, each segmented into p letters, and the latent representation\nis the corresponding letters. Since letters are correlated in their appearance in words, the recognition\nproblem has long been addressed using linear chain conditional random \ufb01elds. However imagine no\nground truth letter label is available, and instead of predicting labels, we are given images in which a\nrandom small patch is occluded. So our goal will be inpainting the patches.\nTo cast the problem in our two-layer latent structure model, let each letter image in the word be denoted\nas a vector xi \u2208 Rn, and the reconstructed image be zi \u2208 Rm (m = n here). Let Yi \u2208 {0, 1}h\u00d7h\n(h = 26) encode the labels of the letter pair at position i and i + 1 (as rows and columns of Yi\nrespectively). Let Uv \u2208 Rh\u00d7n be the letter-wise discriminative weights, and Ue \u2208 Rh\u00d7h be the\npairwise weights. Then by (2), the MAP inference can be reformulated as (ref. de\ufb01nition of H in (3))\n\neYi) where Y =(cid:8){Yi} : Yi \u2208 {0, 1}C\u00d7C(cid:9) \u2229 H. (27)\nS = convY =(cid:8){Yi} : Yi \u2208 [0, 1]C\u00d7C(cid:9) \u2229 H. Finally to reconstruct the image for each letter, we\n(cid:105)\n\nmin\n{Yi}\u2208Y\nSince the non-box constraints in H are TUM, the problem can be cast in our framework with\nassume that each letter j has a basis vector rj \u2208 Rm. So given Wi, the output of reconstruction is\nR(cid:48)Wi1, where R = (r1, . . . , rh)(cid:48). To summarize, our model can be instantiated from (12) as\nmin(cid:107)U(cid:107)\u22641\n\n(cid:88)p\n(cid:88)p\ntr((Uvxi1(cid:48) +(cid:74)i (cid:54)= p(cid:75) Ue + \u03c3Wi)(cid:48)(\u03b3Wi \u2212 \u0398i))\n\n.\nHere zi is the inpainted images in the training set. If no training image is occluded, then just set zi\nto xi. The constraints on U and R can be re\ufb01ned, e.g. bounding (cid:107)Uv(cid:107), (cid:107)Ue(cid:107), and (cid:107)rj(cid:107) separately.\nAs before, we can derive a quadratic term (cid:107)R(vi \u2212 zi)1(cid:48) + \u03b3Uvxi1(cid:48) + \u03b3Ue + \u03a0i \u2212 \u03c3\u0398i(cid:107)2 by min-\nimizing out Wi, which again leads to SDP relaxations. Even further, we may allow each letter to\nemploy a set of principal components, whose combination yields the reconstruction (Appendix B).\nBesides modeling \ufb02exibility, our method also accommodates problem-speci\ufb01c simpli\ufb01cation. For\nexample, the dimension of w is often much higher than the number of non-box constraints. Appendix\nC shows that for linear chain, the dimension of w can be reduced from C 2 to C via partial Lagrangian.\n\nmax\n(\u0398,\u03b3)\u2208N max\n+ tr(\u03a0(cid:48)W ) \u2212 \u03c3S(\u03a0) +\n\n(vi \u2212 zi)(cid:48)R(cid:48)Wi1 \u2212 G\u2217(vi)\n\nmin(cid:107)R(cid:107)\u22641\n\n(x,z)\u223c \u02dcp\n\nmin\nW\n\n(28)\n\nmax\n\n(cid:104)\n\ni=1\n\ni=1\n\ni=1\n\nE\n\n\u03a0\n\nv\n\n6\n\n\f5 Optimization\nThe key advantage of our convex relaxation (19) is that the inference depends on S (or equivalently\nY) only through the polar operator. Our overall optimization scheme is to perform projected SGD\nover the function of M. This requires: a) given M, compute its objective value and gradient; and b)\nproject to M1. We next detail the solution to the former, relegating the latter to Appendix D.\nGiven M, we optimize over (\u03c0, \u03b8, \u03b3, v) by projected LBFGS [33]. The objective is easy to compute\nthanks to PO-tractability (for the \u03c3S(\u03c0) term). The only nontrivial part is to project a point (\u03b80, \u03b30)\nto N , which is actually amenable to conditional gradient (CG). Formally it requires solving\n\nW.l.o.g., we manually introduced an upper bound4 C := \u03b30 +(cid:112)(cid:107)\u03b80(cid:107)2 + \u03b32\n\n(29)\n0 on \u03b3. At each iteration,\nCG queries the gradient g\u03b8 in \u03b8 and g\u03b3 in \u03b3, and solves the polar operator problem on N :\nmin\u03b8\u2208\u03b3S,\u03b3\u2208[0,C] \u03b8(cid:48)g\u03b8 +\u03b3g\u03b3 = mins\u2208S,\u03b3\u2208[0,C] \u03b3s(cid:48)g\u03b8 + \u03b3g\u03b3 = min{0, C mins\u2208S(s(cid:48)g\u03b8 +g\u03b3)}. (30)\nSo it boils down to the polar operator on S, and is hence tractable. If the optimal value in (30) is\nnonnegative, then the current iterate is already optimal. Otherwise we add a basis (s\u2217, 1) to the\nensemble and a totally corrective update can be performed by CG. More details are available in [34].\nAfter \ufb01nding the optimal \u02c6M, we recover the optimal w for each training example based on the\noptimal w in (12). Using it as the initial point, we locally optimize the two layer models U and R\nbased on (14).\n\ns.t. \u03b8 = \u03b3s, \u03b3 \u2208 [0, C], s \u2208 S.\n\nmin\u03b8,\u03b3\n\n1\n\n2 (cid:107)\u03b8 \u2212 \u03b80(cid:107)2 + 1\n\n2 (\u03b3 \u2212 \u03b30)2,\n\n6 Experimental Results\nTo empirically evaluate our convex method (henceforth referred to as CVX), we compared it with the\nstate-of-the-art methods on two prediction problems with latent structure.\nTransliteration The \ufb01rst experiment is based on the English-Hebrew corpus [35]. It consists of\n250 positive transliteration pairs for training, and 300 pairs for testing. On average there are 6\ncharacters per word in each of the languages. All these pairs are considered \u201cpositive examples\",\nand for negative examples we followed [12] and randomly sampled t\u2212 \u2208 {50, 75, 100} pairs from\n2502 \u2212 250 mismatched pairings (which are 20%, 30%, and 40% of 250, resp). We did not use many\nnegative examples because, as per [12], our test performance measure will depend mainly on the\nhighest few discriminative values, which are learned largely from the positive examples.\nGiven a pair of words (e, h), the feature representation \u03c6ij for the i-th letter in e and j-th letter\nin h is de\ufb01ned as the unigram feature: an n-dimensional vector with all 0\u2019s except a single one\nin the (ei, hj)-th coordinate. In this dataset, there are n = 655 possible letter pairs (* included).\nSince our primary objective is to determine whether the convex relaxation of a two-layer model with\nlatent structure can outperform locally trained models, we adopted this simple but effective feature\nrepresentation (rather than delving into heuristic feature engineering).\nOur test evaluation measurement is the Mean Reciprocal Rank (MRR), which is the average of the\nreciprocal of the rank of the correct answer. In particular, for each English word e, we calculated the\ndiscriminative score of respective methods when e is paired with each Hebrew word in the test set,\nand then found the rank of the correct word (1 for the highest). The reciprocal of the rank is averaged\nover all test pairs, giving the MRR. So a higher value is preferred, and 50% means on average the\ntrue Hebrew word is the runner-up. For our method, the discriminative score is simply f := r(cid:48)\u03a6w\n(using the symbols in (25)), and that for [12] is f := maxY \u2208Y u(cid:48)\u03a6vec(Y ) (vectorization of Y ).\nWe compared our method (with \u03c3 = 0.1) against the state-of-the-art approach in [12]. It is a special\ncase of our model with the second-layer weight r tied with the \ufb01rst-layer weight u. They trained it\nusing a local optimization method, and we will refer to it as Local. Both methods employ an output\nloss function max{0, yf}2 with y \u2208 {+1,\u22121}, and both contain only one parameter\u2014the bound on\n(cid:107)u(cid:107) (and (cid:107)r(cid:107)). We simply tuned it to optimize the performance of Local. The test MRR is shown in\nFigure 1, where the number of negative examples was varied in 50, 75, and 100. Local was trained\nwith random initialization, and we repeated the random selection of the negative examples for 10\ntimes, yielding 10 dots in each scatter plot. It is clear that CVX in general delivers signi\ufb01cantly higher\nMRR than Local, with the dots lying above or close to the diagonal. Since this dataset is not big, the\nrandomness of the negative set leads to notable variations in the performance (for both methods).\n\n4For \u03b3 to be optimal, we require (\u03b3 \u2212 \u03b30)2 \u2264 (cid:107)\u03b8\u2212 \u03b80(cid:107)2 + (\u03b3 \u2212 \u03b30)2 \u2264 (cid:107)0\u2212 \u03b80(cid:107)2 + (0\u2212 \u03b30)2, i.e., \u03b3 \u2264 C.\n\n7\n\n\f(a) 50 negative examples\n\n(b) 75 negative examples\n\n(c) 100 negative examples\n\nFigure 1: MRR of Local versus CVX over 50, 75, and 100 negative examples.\n\nk = 2\n\nk = 3\n\nSIZE OF OCCLUDED PATCH (k \u00d7 k)\nCRF-AE 0.29\u00b1 0.01 0.80\u00b1 0.01 1.31\u00b1 0.02\n0.27\u00b1 0.01 0.79\u00b1 0.01 1.28\u00b1 0.02\nCVX\nTable 1: Total inpainting error as a function of the\nsize of occluded patch (p = 8).\n\nk = 4\n\nLENGTH OF SEQUENCE\n\np = 6\n\np = 8\n\np = 4\n\nCRF-AE 1.33\u00b1 0.04 1.30\u00b1 0.02 1.31\u00b1 0.03\n1.29\u00b1 0.04 1.27\u00b1 0.02 1.28\u00b1 0.03\nCVX\nTable 2: Total inpainting error as a function of the\nlength of sequences (k = 4).\n\nInpainting for occluded image Our second experiment used structured latent model to inpaint\nimages. We generated 200 sequences of images for training, each with p \u2208 {4, 6, 8} digits. In order\nto introduce structure, each sequence can be either odd (i.e. all digits are either 1 or 3) or even (all\ndigits are 2 or 4). So C = 4. Given the digit label, the corresponding image (x \u2208 [0, 1]196) was\nsampled from the MNIST dataset, downsampled to 14-by-14. 200 test sequences were also generated.\nIn the test data, we randomly set a k \u00d7 k patch of each image to 0 as occluded (k \u2208 {2, 3, 4}), and the\ntask is to inpaint it. This setting is entirely unsupervised, with no digit label available for training. It\nfalls in the framework of X \u2192 Y \u2192 Z, where X is the occluded input, Y is the latent digit sequence,\nand Z is the recovered image. In our convex method, we tied Uv with R and so we still have a 3-by-3\nblock matrix M, corresponding to I, Uv and Ue. We set \u03c3 to 10\u22121 and G(\u00b7) = 1\n2 (cid:107)\u00b7(cid:107)2 (Gaussian). Y\nwas predicted using the polar operator, based on which Z was predicted with the Gaussian mean.\nFor comparison, we used CRF-AE, which was proposed very recently by [7]. Although it ties X and\nZ, extension to our setting is trivial by computing the expected value of Z given X. Here P (Z|Y ) is\nassumed a Gaussian whose mean is learned by maximizing P (Z = x|X = x), and we initialized all\nmodel parameters by unit Gaussian. For the ease of comparison, we introduced regularization by\nconstraining model parameters to L2 norm balls rather than penalizing the squared L2 norm. For\nboth methods, the radius bound was simply chosen as the maximum L2 norm of the images, which\nproduced consistently good results. We did not use higher k because the images are sized 14-by-14.\nThe error of inpainting given by the two methods is shown in Table 1 where we varied the size of\nthe occluded patch with p \ufb01xed to 6, and in Table 2 where the length of the sequence p was varied\nwhile k was \ufb01xed to 4. Each number is the sum of squared error in the occluded patch, averaged over\n5 random generations of training and test data (hence producing the mean and standard deviation).\nHere we can see that CVX gives lower error than CRF-AE. With no surprise, the error grows almost\nquadratically in k. When the length of sequence grows, the error of both CVX and CRF-AE \ufb02uctuates\nnonmonotonically. This is probably because with more images in each node, the total error is summed\nover more images, but the error per image decays thanks to the structure.\n\n7 Conclusion\n\nWe have presented a new formulation of two-layer models with latent structure, while maintaining\na jointly convex training objective. Its effectiveness is demonstrated by the superior empirical\nperformance over local training, along with low-rank characterization of the extreme points of the\nfeasible region. An interesting extension for future investigation is when the latent layer employs\nsubmodularity, with its base polytope mirroring the support set S.\n\n8\n\n50607080MRR of Local50607080MRR of CVX50607080MRR of Local50607080MRR of CVX708090MRR of Local708090MRR of CVX\fReferences\n[1] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18:1527\u20131554, 2006.\n\n[2] L.-C. Chen, A. Schwing, A. Yuille, and R. Urtasun. Learning deep structured models. In ICML. 2015.\n[3] V. Mnih, H. Larochelle, and G. E. Hinton. Conditional restricted Boltzmann machines for structured output\n\nprediction. In AISTATS. 2011.\n\n[4] M. Ratajczak, S. Tschiatschek, and F. Pernkopf. Sum-product networks for structured prediction: Context-\nspeci\ufb01c deep conditional random \ufb01elds. In Workshop on Learning Tractable Probabilistic Models. 2014.\n[5] K. Sohn, X. Yan, and H. Lee. Learning structured output representation using deep conditional generative\n\nmodels. In NIPS. 2015.\n\n[6] R. Collobert. Deep learning for ef\ufb01cient discriminative parsing. In ICML. 2011.\n[7] W. Ammar, C. Dyer, and N. A. Smith. Conditional random \ufb01eld autoencoders for unsupervised structured\n\nprediction. In NIPS. 2014.\n\n[8] L. Xu, D. Wilkinson, F. Southey, and D. Schuurmans. Discriminative unsupervised learning of structured\n\npredictors. In ICML. 2006.\n\n[9] H. Daum\u00e9 III. Unsupervised search-based structured prediction. In ICML. 2009.\n[10] N. Smith and J. Eisner. Contrastive estimation: training log-linear models on unlabeled data. In ACL. 2005.\n[11] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1771\u20131800, 2002.\n\n[12] M.-W. Chang, D. Goldwasser, D. Roth, and V. Srikumar. Discriminative learning over constrained latent\n\nrepresentations. In NAACL. 2010.\n\n[13] M.-W. Chang, V. Srikumar, D. Goldwasser, and D. Roth. Structured output learning with indirect\n\nsupervision. In ICML. 2010.\n\n[14] N. Chen, J. Zhu, F. Sun, and E. P. Xing. Large-margin predictive latent subspace learning for multiview\ndata analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12):2365\u20132378, 2012.\n[15] H. Larochelle and Y. Bengio. Classi\ufb01cation using discriminative restricted Boltzmann machines. In ICML.\n\n2008.\n\n[16] T. Hazan and T. Jaakkola. On the partition function and random maximum a-posteriori perturbations. In\n\nICML. 2012.\n\n[17] A. Gane, T. Hazan, and T. Jaakkola. Learning with random maximum a-posteriori perturbation models. In\n\nAISTATS. 2014.\n\n[18] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin. Learning structured prediction models: A large\n\nmargin approach. In ICML. 2005.\n\n[19] O. Aslan, X. Zhang, and D. Schuurmans. Convex deep learning via normalized kernels. In NIPS. 2014.\n[20] Y. Bengio, N. L. Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In NIPS. 2005.\n[21] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. In\n\nICML. 2014.\n\n[22] R. Livni, S. Shalev-Shwartz, and O. Shamir. An algorithm for training polynomial networks, 2014.\n\nArXiv:1304.7045v2.\n\n[23] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of\n\nMachine Learning Research, 6:1705\u20131749, 2005.\n\n[24] A. Gotovos, H. Hassani, and A. Krause. Sampling from probabilistic submodular models. In NIPS. 2015.\n[25] G. Haffari and A. Sarkar. Analysis of semi-supervised learning with Yarowsky algorithm. In UAI. 2007.\n[26] L. Xu, M. White, and D. Schuurmans. Optimal reverse prediction: a uni\ufb01ed perspective on supervised,\n\nunsupervised and semi-supervised learning. In ICML. 2009.\n\n[27] Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127\u2013152, 2005.\n[28] O. Meshi, M. Mahdavi, and A. G. Schwing. Smooth and strong: Map inference with linear convergence.\n\nIn NIPS. 2015.\n\n[29] G. Druck, C. Pal, X. Zhu, and A. McCallum. Semi-supervised classi\ufb01cation with hybrid genera-\n\ntive/discriminative methods. In KDD. 2007.\n\n[30] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S.Willsky. The convex geometry of linear inverse\n\nproblems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[31] G. Pataki. On the rank of extreme matrices in semide\ufb01nite programs and the multiplicity of optimal\n\neigenvalues. Mathematics of Operations Research, 23(2):339\u2013358, 1998.\n\n[32] D. Goldwasser and D. Roth. Transliteration as constrained optimization. In EMNLP. 2008.\n[33] http://www.cs.ubc.ca/~schmidtm/Software/minConf.html.\n[34] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML. 2013.\n[35] https://cogcomp.cs.illinois.edu/page/resource_view/2.\n\n9\n\n\f", "award": [], "sourceid": 694, "authors": [{"given_name": "Vignesh", "family_name": "Ganapathiraman", "institution": "University Of Illinois at Chicago"}, {"given_name": "Xinhua", "family_name": "Zhang", "institution": "UIC"}, {"given_name": "Yaoliang", "family_name": "Yu", "institution": "Carnegie Mellon University"}, {"given_name": "Junfeng", "family_name": "Wen", "institution": "UofA"}]}