{"title": "Structured Learning with Approximate Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 785, "page_last": 792, "abstract": null, "full_text": "Structured Learning with Approximate Inference\n\nAlex Kulesza and Fernando Pereira\u2217\n\nDepartment of Computer and Information Science\n{kulesza, pereira}@cis.upenn.edu\n\nUniversity of Pennsylvania\n\nAbstract\n\nIn many structured prediction problems, the highest-scoring labeling is hard to\ncompute exactly, leading to the use of approximate inference methods. However,\nwhen inference is used in a learning algorithm, a good approximation of the score\nmay not be suf\ufb01cient. We show in particular that learning can fail even with an\napproximate inference method with rigorous approximation guarantees. There are\ntwo reasons for this. First, approximate methods can effectively reduce the expres-\nsivity of an underlying model by making it impossible to choose parameters that\nreliably give good predictions. Second, approximations can respond to parameter\nchanges in such a way that standard learning algorithms are misled. In contrast, we\ngive two positive results in the form of learning bounds for the use of LP-relaxed\ninference in structured perceptron and empirical risk minimization settings. We\nargue that without understanding combinations of inference and learning, such as\nthese, that are appropriately compatible, learning performance under approximate\ninference cannot be guaranteed.\n\n1 Introduction\n\nStructured prediction models commonly involve complex inference problems for which \ufb01nding ex-\nact solutions is intractable [1]. There are two ways to address this dif\ufb01culty. Directly, models used in\npractice can be restricted to those for which inference is feasible, such as conditional random \ufb01elds\non trees [2] or associative Markov networks with binary labels [3]. More generally, however, ef\ufb01-\ncient but approximate inference procedures have been devised that apply to a wide range of models,\nincluding loopy belief propagation [4, 5], tree-reweighted message passing [6], and linear program-\nming relaxations [7, 3], all of which give ef\ufb01cient approximate predictions for graphical models of\narbitrary structure.\nSince some form of inference is the dominant subroutine for all structured learning algorithms, it\nis natural to see good approximate inference techniques as solutions to the problem of tractable\nlearning as well. A number of authors have taken this approach, using inference approximations as\ndrop-in replacements during training, often with empirical success [3, 8]. And yet there has been\nlittle theoretical analysis of the relationship between approximate inference and reliable learning.\nWe demonstrate with two counterexamples that the characteristics of approximate inference algo-\nrithms relevant for learning can be distinct from those, such as approximation guarantees, that make\nthem appropriate for prediction. First, we show that approximations can reduce the expressivity\nof a model, making previously simple concepts impossible to implement and hence to learn, even\nthough inference meets an approximation guarantee. Second, we show that standard learning algo-\nrithms can be led astray by inexact inference, failing to \ufb01nd valid model parameters. It is therefore\ncrucial to choose compatible inference and learning procedures.\n\n\u2217This work is based on research supported by NSF ITR IIS 0428193.\n\n1\n\n\fWith these considerations in mind, we prove that LP-relaxation-based approximate inference proce-\ndures are compatible with the structured perceptron [9] as well as empirical risk minimization with\na margin criterion using the PAC-Bayes framework [10, 11].\n\n2 Setting\nGiven a scoring model S(y|x) over candidate labelings y for input x, exact Viterbi inference is the\ncomputation of the optimal labeling\n\nh(x) = arg max\n\ny\n\nS(y|x) .\n\n(1)\n\n(2)\n\nIn a prediction setting, the goal of approximate inference is to compute ef\ufb01ciently a prediction with\nthe highest possible score. However, in learning a tight relationship between the scoring model\nand true utility cannot be assumed; after all, learning seeks to \ufb01nd such a relationship. Instead,\nwe assume a \ufb01xed loss function L(y|x) that measures the true cost of predicting y given x, a\ndistribution D over inputs x, and a parameterized scoring model S\u03b8(y|x) with associated optimal\nlabeling function h\u03b8 and inference algorithm A\u03b8. Exact inference implies A\u03b8 = h\u03b8. Learning seeks\nthe risk minimizer:\n\n\u03b8\u2217 = arg min\n\nEx\u223cD [L(A\u03b8(x)|x)] .\n\n\u03b8\n\nSuccessful learning, then, requires two things: the existence of \u03b8 for which risk is suitably low, and\nthe ability to \ufb01nd such \u03b8 ef\ufb01ciently. In this work we consider the impact of approximate inference\non both criteria. We model our examples as pairwise Markov random \ufb01elds (MRFs) de\ufb01ned over a\ngraph G = (V, E) with probabilistic scoring model\n\n\u03c8ij(yi, yj|x) ,\n\n(3)\n\nP (y|x) \u221dY\n\n\u03c8i(yi|x) Y\n\ni\u2208V\n\nij\u2208E\n\nwhere \u03c8i(yi|x) and \u03c8ij(yi, yj|x) are positive potentials. For learning, we use log-linear potentials\n\u03c8i(yi|x) = exp(w\u00b7 f(x, yi)) assuming a feature function f(\u00b7) and parameter vector w. Since MRFs\nare probabilistic, we also refer to Viterbi inference as maximum a posteriori (MAP) inference.\n\n3 Algorithmic separability\n\nThe existence of suitable model parameters \u03b8 is captured by the standard notion of separability.\nDe\ufb01nition 1. A distribution D (which can be empirical) is separable with respect to a model\nS\u03b8(y|x) and loss L(y|x) if there exists \u03b8 such that Ex\u223cD [L(h\u03b8(x), x)] = 01.\nHowever, approximate inference may not be able to match exactly the separating hypothesis h\u03b8. We\nneed a notion of separability that takes into account the (approximate) inference algorithm.\nDe\ufb01nition 2. A distribution D is algorithmically separable with respect to parameterized inference\nalgorithm A\u03b8 and loss L(y|x) if there exists \u03b8 such that Ex\u223cD [L(A\u03b8(x), x)] = 0.\nWhile separability characterizes data distributions with respect to models, algorithmic separability\ncharacterizes data distributions with respect to inference algorithms. Note that algorithmic separa-\nbility is more general than standard separability for any decidable model, since we can design an\n(inef\ufb01cient) algorithm A\u03b8(x) = h\u03b8(x)2. However, we show by counterexample that even algorithms\nwith provable approximation guarantees can make separable problems algorithmically inseparable.\n\n3.1 LP-relaxed inference\n\nConsider the simple Markov random \ufb01eld pictured in Figure 1, a triangle in which each node has as\nits set of allowed labels a different pair of the three possible labels A, B, and C. Let the node poten-\ntials \u03c8i(yi) be \ufb01xed to 1 so that labeling preferences derive only from edge potentials. For positive\n\n1Separability can be weakened to allow nonzero risk, but for simplicity we focus on the strict case.\n2Note further that algorithmic separability supports inference algorithms that are not based on any abstract\nmodel at all; such algorithms can describe arbitrary \u201cblack box\u201d functions from parameters to predictions. It\nseems unlikely, however, that such algorithms are of much use since their parameters cannot be easily learned.\n\n2\n\n\fconstants \u03bbij, de\ufb01ne edge potentials \u03c8ij(yi, yj) = exp(\u03bbij) whenever yi = yj and \u03c8ij(yi, yj) = 1\notherwise. Then the joint probability of a con\ufb01guration y = (y1, y2, y3) is given by\n\nexp(\u03bbij) = exp\n\nI(yi = yj)\u03bbij\n\n(4)\n\nP (y) \u221d Y\nhP\n\nij:yi=yj\n\n\uf8eb\uf8edX\ni\n\ni,j\n\n\uf8f6\uf8f8\n\nI(yi = yj)\u03bbij\n\n.\n\ni,j\n\nand the MAP labeling is arg maxy\nNote that this example is associative; that is, neighboring nodes are en-\ncouraged to take identical labels (\u03bbij > 0). We can therefore perform\napproximate inference using a linear programming (LP) relaxation and get\na multiplicative approximation guarantee [3]. We begin by writing an inte-\nger program for computing the MAP labeling; below, \u00b5i(yi) indicates node\ni taking label yi (which ranges over the two allowed labels for node i) and\n\u00b5ij(yi, yj) indicates nodes i and j taking labels yi and yj, respectively.\n\n\u03bb12\u00b512(B, B) + \u03bb23\u00b523(C, C) + \u03bb31\u00b531(A, A)\n\nX\n\nmax\n\n\u00b5\ns.t.\n\n\u00b5i(yi) \u2264 1 \u2200i\n\nyi\n\n\u00b5ij(yi, yj) \u2264 \u00b5i(yi) \u2200ij, yi, yj\n\u00b5 \u2208 {0, 1}dim(\u00b5)\n\nFigure 1: A simple\nMRF. Each node is\nannotated with its al-\nlowed labels.\n\nInteger programming is NP-hard, so we use an LP-relaxation by replacing the integrality constraint\nwith \u00b5 \u2265 0. Letting i\u2217j\u2217 = arg maxij \u03bbij, it is easy to see that the correct MAP con\ufb01guration\nassigns matching labels to nodes i\u2217 and j\u2217 and an arbitrary label to the third. The score for this\ncon\ufb01guration is \u03bbi\u2217j\u2217. However, the LP-relaxation may generate fractional solutions. In particular,\nwhenever (\u03bb12 + \u03bb23 + \u03bb31)/2 > \u03bbi\u2217j\u2217 the con\ufb01guration that assigns to every node both of its\nallowed labels in equal proportion\u2014\u00b5 = 1/2\u2014is optimal.\nThe fractional labeling \u00b5 = 1/2 is the most uninformative possible; it suggests that all labelings are\nequally valid. Even so, (\u03bb12 + \u03bb23 + \u03bb31)/2 \u2264 3\u03bbi\u2217j\u2217 /2 by the de\ufb01nition of i\u2217j\u2217, so LP-relaxed\ninference for this MRF has a relatively good approximation ratio of 3/2.\n\n3.2 Learning with LP-relaxed inference\n\nSuppose now that we wish to learn to predict labelings y from instances of the MRF in Figure 1\nwith positive features given by x = (x12, x23, x31). We will parameterize the model using a positive\nweight vector w = (w12, w23, w31), letting \u03bbij = wijxij.\nSuppose the data distribution gives equal probability to inputs x = (4, 3, 3), (3, 4, 3), and (3, 3, 4),\nand that the loss function is de\ufb01ned as follows. Given x, let i\u2217j\u2217 = arg maxij xij. Then assigning\nmatching labels to nodes i\u2217 and j\u2217 and an arbitrary label to the third node yields a 0-loss con\ufb01gura-\ntion. All other con\ufb01gurations have positive loss. It is clear, \ufb01rst of all, that this problem is separable;\nif w = (1, 1, 1), \u03bbij = xij and the solution to the integer program above coincides with the labeling\nrule. Furthermore, there is margin: any weight vector in a neighborhood of (1, 1, 1) assigns the\nhighest probability to the correct labeling.\nUsing LP-relaxed inference, however, the problem is impossible to learn. In order to correctly label\nthe instance x = (4, 3, 3) we must have, at a minimum, \u03bb12 > \u03bb23, \u03bb31 (equivalently 4w12 >\n3w23, 3w31) since the 0-loss labeling must have higher objective score than any other labeling.\nReasoning similarly for the remaining instances, any separating weight vector must satisfy 4wij >\n3wkl for each pair of edges (ij, kl). Without loss of generality, assume an instance to be labeled has\nfeature vector x = (4, 3, 3). Then,\n\n1\n2\n\n(\u03bb12 + \u03bb23 + \u03bb31) =\n\n(4w12 + 3w23 + 3w31)\n3\n4 w12)\n\n3\n4 w12 + 3\n\n(4w12 + 3\n\n>\n> 4w12\n= \u03bb12 .\n\n1\n2\n1\n2\n\n3\n\n\fAs a result, LP-relaxed inference predicts \u00b5 = 1/2. The data cannot be correctly labeled using\nan LP-relaxation with any choice of weight vector, and the example is therefore algorithmically\ninseparable.\n\n4 Insuf\ufb01ciency of algorithmic separability\n\nWe cannot expect to learn without algorithmic separability; no amount of training can hope to be\nsuccessful when there simply do not exist acceptable model parameters. Nevertheless, we could\ndraw upon the usual techniques for dealing with (geometric) inseparability in this case.\nApproximate inference introduces another complication, however. Learning techniques exploit as-\nsumptions about the underlying model to search parameter space; the perceptron, for example, as-\nsumes that increasing weights for features present in correct labelings but not incorrect labelings will\nlead to better predictions. While this is formally true with respect to an underlying linear model,\ninexact inference methods can disturb and even invert such assumptions.\n\n4.1 Loopy inference\n\nLoopy belief propagation (LBP) is a common approximate inference procedure in which max-\nproduct message passing, known to be exact for trees, is applied to arbitrary, cyclic graphical models\n[5]. While LBP is, of course, inexact, its behavior can be even more problematic for learning. Be-\ncause LBP does not respond to model parameters in the usual way, its predictions can lead a learner\naway from appropriate parameters even for algorithmically separable problems.\n\nConsider the simple MRF shown in Figure 2 and discussed previously in\n[6]. All nodes are binary and take labels from the set {\u22121, 1}. Suppose\nthat node potentials are assigned by type, where each node is of type A or\nB as indicated and \u03b1 and \u03b2 are real-valued parameters:\n\n\u03c8A(\u22121) = 1\n\u03c8B(\u22121) = 1\n\n\u03c8A(1) = e\u03b1\n\u03c8B(1) = e\u03b2\n\nFigure 2: An MRF on\nwhich LBP is inexact.\n\nAlso let edge potentials \u03c8ij(yi, yj) be equal to the constant \u03bb when yi = yj\nand 1 otherwise. De\ufb01ne \u03bb to be suf\ufb01ciently positive that the MAP con\ufb01gu-\nration is either (\u22121,\u22121,\u22121,\u22121) or (1, 1, 1, 1), abbreviated by \u22121 and 1,\nrespectively. In particular, the solution is \u22121 when \u03b1 + \u03b2 < 0 and 1 oth-\nerwise. With slight abuse of notation we can write yMAP = sign(\u03b1 + \u03b2).\nWe now investigate the behavior of LBP on this example. In general, max-product LBP on pairwise\nMRFs requires iterating the following rule to update messages mij(yj) from node i to node j, where\nyj ranges over the possible labels for node j and N(i) is the neighbor set of node i.\n\n\uf8ee\uf8f0\u03c8ij(yi, yj)\u03c8i(yi) Y\n\n\uf8f9\uf8fb\n\nmij(yj) = max\nyi\n\nmki(yi)\n\nk\u2208N (i)\\{j}\n\n(5)\n\nSince we take \u03bb to be suitably positive in our example, we can eliminate the max, letting yi = yj, and\nthen divide to remove the edge potentials \u03c8ij(yj, yj) = \u03bb. When messages are initialized uniformly\nto 1 and passed in parallel, symmetry also implies that messages are completely determined by the\nthe types of the relevant nodes. The updates are then as follows.\n\nmAB(\u22121) = mBA(\u22121)\nmBA(\u22121) = mAB(\u22121)mBB(\u22121)\nmBB(\u22121) = m2\n\nAB(\u22121)\n\nmAB(1) = e\u03b1mBA(1)\nmBA(1) = e\u03b2mAB(1)mBB(1)\nmBB(1) = e\u03b2m2\n\nAB(1)\n\nNote that messages mij(\u22121) remain \ufb01xed at 1 after any number of updates. Messages mAB(1),\nmBA(1), and mBB(1) always take the form exp(p\u03b1 + q\u03b2) for appropriate values of p and q, and it\nis easy to show by iterating the updates that, for all three messages, p and q go to \u221e while the ratio\nq/p converges to \u03b3 \u2248 1.089339. The label 1 messages, therefore, approach 0 when \u03b1 + \u03b3\u03b2 < 0 and\n\u221e when \u03b1 + \u03b3\u03b2 > 0. Note that after message normalization (mij(\u22121) + mij(1) = 1 for all ij) the\nalgorithm converges in either case.\n\n4\n\n\f(a) y = \u22121\n\n(b) y = 1\n\nFigure 3: A two-instance training set. Within each instance, nodes of the same shading share a\nfeature vector, as annotated. Below each instance is its correct labeling.\n\nBeliefs are computed from the converged messages as bi(yi) \u221dQ\n\nj\u2208N (i) mji(yi), so we can express\nthe prediction of LBP as yLBP = sign(\u03b1 + \u03b3\u03b2). Intuitively, then, LBP gives a slight preference to\nthe B-type nodes because of their shared edge. If \u03b1 and \u03b2 are both positive or both negative, or if\n\u03b1 and \u03b2 differ in sign but |\u03b2| > |\u03b1| or |\u03b1| > \u03b3|\u03b2|, LBP \ufb01nds the correct MAP solution. However,\nwhen the strength of the A nodes only slightly exceeds that of the B nodes (\u03b3|\u03b2| > |\u03b1| > |\u03b2|),\nthe preference exerted by LBP is signi\ufb01cant enough to \ufb02ip the labels. For example, if \u03b1 = 1 and\n\u03b2 = \u22120.95, the true MAP con\ufb01guration is 1 but LBP converges to \u22121.\n\n4.2 Learning with LBP\n\nSuppose now that we wish to use the perceptron algorithm with LBP inference to learn the two-\ninstance data set shown in Figure 3. For each instance the unshaded nodes are annotated with a\nfeature vector x\u03b1 = (x\u03b11, x\u03b12) and the shaded nodes are annotated with a feature vector x\u03b2 =\n(x\u03b21, x\u03b22). We wish to learn weights w = (w1, w2), modeling node potentials as before with\n\u03b1 = w \u00b7 x\u03b1 and \u03b2 = w \u00b7 x\u03b2. Assume that edge potentials remain \ufb01xed using a suitably positive \u03bb.\nBy the previous analysis, the data are algorithmically separated by w\u2217 = (1,\u22121). On instance (a),\n\u03b1 = 1, \u03b2 = \u22120.95, and LBP correctly predicts \u22121. Instance (b) is symmetric. Note that although\nthe predicted con\ufb01gurations are not the true MAP labelings, they correctly match the training labels.\nThe weight vector (1,\u22121) is therefore an ideal choice in the context of learning. The problem is\nalso separated in the usual sense by the weight vector (\u22121, 1).\nSince we can think of the MAP decision problem as comput-\ning sign(\u03b1 + \u03b2) = sign (w \u00b7 (x\u03b1 + x\u03b2)), we can apply the\nperceptron algorithm with update w \u2190 w \u2212 \u02c6y(x\u03b1 + x\u03b2),\nwhere \u02c6y is the sign of the proposed labeling. The standard\nperceptron mistake bound guarantees that separable problems\nrequire only a \ufb01nite number of iterations with exact infer-\nence to \ufb01nd a separating weight vector. Here, however, LBP\ncauses the perceptron to diverge even though the problem is\nnot only separable but also algorithmically separable.\nFigure 4 shows the path of the weight vector as it progresses\nfrom the origin over the \ufb01rst 20 iterations of the algorithm.\nFigure 4: Perceptron learning path.\nDuring each pass through the data the weight vector is up-\ndated twice: once after mislabeling instance (a) (w \u2190 w \u2212 (1, 0.95)), and again after mislabeling\ninstance (b) (w \u2190 w + (0.95, 1)). The net effect is w \u2190 w + (\u22120.05, 0.05). The weight vector\ncontinually moves in the opposite direction of w\u2217 = (1,\u22121), and learning diverges.\n\n4.3 Discussion\n\nTo understand why perceptron learning fails with LBP, it is instructive to visualize the feasible\nregions of weight space. Exact inference correctly labels instance (a) whenever w1 + 0.95w2 < 0,\nand, similarly, instance (b) requires a weight vector with 0.95w1 + w2 > 0. Weights that satisfy\nboth constraints are feasible, as depicted in Figure 5(a). For LBP, the preference given to nodes\n2 and 3 is effectively a scaling of x\u03b2 by \u03b3 \u2248 1.089339, so a feasible weight vector must satisfy\n\n5\n\n\f(a) Exact inference\n\n(b) LBP\n\nFigure 5: The feasible regions of weight space for exact inference and LBP. Each numbered gray\nhalfspace indicates the region in which the corresponding instance is correctly labeled; their inter-\nsection is the feasible region, colored black.\n\nw1 + 0.95\u03b3w2 < 0 and 0.95\u03b3w1 + w2 > 0. Since 0.95\u03b3 > 1, these constraints de\ufb01ne a completely\ndifferent feasible region of weight space, shown in Figure 5(b). It is clear from the \ufb01gures why\nperceptron does not succeed; it assumes that pushing weights into the feasible region of Figure 5(a)\nwill produce correct labelings, while under LBP the exact opposite is required.\nAlgorithmic separability, then, is necessary for learning but may not be suf\ufb01cient. This does not\nimply that no algorithm can learn using LBP; a grid search on weight space, for example, will be\nslow but successful. Instead, care must be taken to ensure that learning and inference are appropri-\nately matched. In particular, it is generally invalid to assume that an arbitrary choice of approximate\ninference will lead to useful results when the learning method expects exact feedback.\n\n5 Learning bounds for approximate inference\n\nIn contrast to the failure of LBP in Section 4, appropriate pairs of inference and learning algorithms\ndo exist. We give two bounds using LP-relaxed inference for MRFs with log-linear potentials. First,\nunder the assumption of algorithmic separability, we show that the structured perceptron of Collins\n[9] makes only a \ufb01nite number of mistakes. Second, we show using the PAC-Bayesian framework\n[11] that choosing model parameters to minimize a margin-based empirical risk function (assuming\n\u201csoft\u201d algorithmic separability) gives rise to a bound on the true risk. In both cases, the proofs are\ndirectly adapted from known results using the following characterization of LP-relaxation.\nClaim 1. Let z = (z1, . . . , zk) be the vector of 0/1 optimization variables for an integer program\nP . Let Z \u2286 {0, 1}dim(z) be the feasible set of P . Then replacing integrality constraints in P with\nbox constraints 0 \u2264 zi \u2264 1 yields an LP with a feasible polytope having vertices Z0 \u2287 Z.\nProof. Each z \u2208 Z is integral and thus a vertex of the polytope de\ufb01ned by box constraints alone.\nThe remaining constraints appear in P and by de\ufb01nition do not exclude any element of Z. The\naddition of constraints cannot eliminate a vertex without rendering it infeasible. Thus, Z \u2286 Z0. (cid:3)\nWe can encode the MAP inference problem for MRFs as an integer program over indicators z with\nobjective w \u00b7 \u03a6(x, z) for some \u03a6 linear in z (see, for example, [6]). By Claim 1 and the fact that an\noptimal vertex always exists, LP-relaxed inference given an input x computes\n\nLPw(x) = arg max\nz\u2208Z0(x)\n\nw \u00b7 \u03a6(x, z) .\n\n(6)\n\nWe can think of this as exact inference over an expanded set of labelings Z0(x), some of which may\nnot be valid (i.e., z \u2208 Z0(x) may be fractional). To simplify notation, we will assume that labelings\ny are always translated into corresponding indicator values z.\n\n5.1 Perceptron\nTheorem 1 (adapted from Theorem 1 in [9]). Given a sequence of input/labeling pairs {(xi, zi)},\nsuppose that there exists a weight vector w\u2217 with unit norm and \u03b3 > 0 such that, for all i, w\u2217 \u00b7\n(\u03a6(xi, zi) \u2212 \u03a6(xi, z)) \u2265 \u03b3 for all z \u2208 Z0(xi) \\ {zi}. (The instances are algorithmically separable\nwith margin \u03b3.) Suppose that there also exists R such that k\u03a6(xi, zi) \u2212 \u03a6(xi, z)k \u2264 R for all\nz \u2208 Z0(xi). Then the structured perceptron makes at most R2/\u03b32 mistakes.\n\n6\n\n\fProof sketch. Let wk be the weight vector before the kth mistake; w1 = 0. Following the proof of\nCollins without modi\ufb01cation, we can show that kwk+1k \u2265 k\u03b3. We now bound kwk+1k in the other\ndirection. If (xk, zk) is the instance on which the kth update occurs and zLP(k) = LPwk(xk), then\nby the update rule,\nkwk+1k2 = kwkk2 + 2wk \u00b7 (\u03a6(xk, zk) \u2212 \u03a6(xk, zLP(k))) + k\u03a6(xk, zk) \u2212 \u03a6(xk, zLP(k))k2\n\n(7)\nThe inequality follows from the fact that LP-relaxed inference maximizes w \u00b7 \u03a6(xk, z) over all\nz \u2208 Z0(xk), so the middle term is nonpositive. Hence, by induction, kwk+1k2 \u2264 kR2. Combining\nthe two bounds, k2\u03b32 \u2264 kwk+1k2 \u2264 kR2, hence k \u2264 R2/\u03b32.\n(cid:3)\n\n\u2264 kwkk2 + R2 .\n\n5.2 PAC-Bayes\n\nmX\n\ni=1\n\n1\nm\n\nThe perceptron bound applies when data are perfectly algorithmically separable, but we might also\nhope to use LP-relaxed inference in the presence of noisy or otherwise almost-separable data. The\nfollowing theorem adapts an empirical risk minimization bound using the PAC-Bayes framework to\nshow that LP-relaxed inference can also be used to learn successfully in these cases. The measure\nof empirical risk for a weight vector w over a sample S = (x1, . . . , xm) is de\ufb01ned as follows.\n\n\u02c6R(w, S) =\nHw(x) = {z0 \u2208 Z0(x) | w \u00b7 (\u03a6(x, LPw(x)) \u2212 \u03a6(x, z0)) \u2264 |LPw(x) \u2212 z0|}\n\nL(z|xi)\n\nz\u2208Hw(xi)\n\nmax\n\n(8)\n\nIntuitively, \u02c6R accounts for the maximum loss of any z that is closer in score than in 1-norm to the\nLP prediction. Such z are considered \u201cconfusable\u201d at test time. The PAC-Bayesian setting requires\nthat, after training, weight vectors are drawn from some distribution Q(w); however, a deterministic\nversion of the bound can also be proved.\nTheorem 2 (adapted from Theorem 3 in [11]). Suppose that loss function L(y|x) is bounded be-\ntween 0 and 1 and can be expanded to L(z|x) for all z \u2208 Z0(x); that is, loss can be de\ufb01ned for\nevery potential value of LP(x). Let \u2018 = dim(z) be the number of indicator variables in the LP, and\nlet R bound the 2-norm of a feature vector for a single clique. Let Q(w) be a symmetric Gaussian\ncentered at w as de\ufb01ned in [11]. Then with probability at least 1 \u2212 \u03b4 over the choice of a sample S\nof size m from distribution D over inputs x, the following holds for all w.\n\ns\n\nEx\u223cD,w0\u223cQ(w) [L(LPw0(x)|x)] \u2264 \u02c6R(w, S) +\n\nR2kwk2 ln(\n\n2\u2018m\n\n\u03b4 )\nR2kwk2 ) + ln( m\n\n2(m \u2212 1)\n\n+ R2kwk2\n\nm\n\n(9)\n\nThe proof in [11] can be directly adapted; the only signi\ufb01cant changes are the use of Z0 in place of\nthe set Y of possible labelings and reasoning as above using the de\ufb01nition of LP-relaxed inference.\n\n6 Related work\n\nA number of authors have applied inference approximations to a wide range of learning problems,\nsometimes with theoretical analysis of approximation quality and often with good empirical results\n[8, 12, 3]. However, none to our knowledge has investigated the theoretical relationship between\napproximation and learning performance. Daume et al. [13] developed a method for using a linear\nmodel to make decisions during a search-based approximate inference process. They showed that\nperceptron updates give rise to a mistake bound under the assumption that parameters leading to cor-\nrect decisions exist. Such results are analogous to those presented in Section 5 in that performance\nbounds follow from an (implicit) assumption of algorithmic separability.\nWainright [14] proved that when approximate inference is required at test time due to computational\nconstraints, using an inconsistent (approximate) estimator for learning can be bene\ufb01cial. His result\nsuggests that optimal performance is obtained when the methods used for training and testing are\nappropriately aligned, even if those methods are not independently optimal. In contrast, we consider\nlearning algorithms that use identical inference for both training and testing, minimizing a gen-\neral measure of empirical risk rather than maximizing data likelihood, and argue for compatibility\nbetween the learning method and inference process.\n\n7\n\n\fRoth et al. [15] consider learning independent classi\ufb01ers for single labels, essentially using a trivial\nform of approximate inference. They show that this method can outperform exact inference learning\nwhen algorithmic separability holds precisely because approximation reduces expressivity; i.e., less\ncomplex models require fewer samples to train accurately. When the data are not algorithmically\nseparable, exact inference provides better performance if a large enough sample is available.\nIt\nis interesting to note that both of our counterexamples involve strong edge potentials. These are\nprecisely the kinds of examples that are dif\ufb01cult to learn using independent classi\ufb01ers.\n\n7 Conclusion\n\nEffective use of approximate inference for learning depends on two considerations that are irrele-\nvant for prediction. First, the expressivity of approximate inference, and consequently the bias for\nlearning, can vary signi\ufb01cantly from that of exact inference. Second, learning algorithms can mis-\ninterpret feedback received from approximate inference methods, leading to poor results or even\ndivergence. However, when algorithmic separability holds, the use of LP-relaxed inference with\nstandard learning frameworks yields provably good results.\nFuture work includes the investigation of alternate inference methods that, while potentially less\nsuitable for prediction alone, give better feedback for learning. Conversely, learning methods that\nare tailored speci\ufb01cally to particular inference algorithms might show improved performance over\nthose that assume exact inference. Finally, the notion of algorithmic separability and the ways in\nwhich it might relate (through approximation) to traditional separability deserve further study.\n\nReferences\n[1] Gregory F. Cooper. The computational complexity of probabilistic inference using Bayesian belief net-\n\nworks (research note). Artif. Intell., 42(2-3):393\u2013405, 1990.\n\n[2] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random \ufb01elds: Proba-\nbilistic models for segmenting and labeling sequence data. In ICML \u201901: Proceedings of the Eighteenth\nInternational Conference on Machine Learning, pages 282\u2013289, 2001.\n\n[3] Ben Taskar, Vassil Chatalbashev, and Daphne Koller. Learning associative Markov networks. In ICML\n\n\u201904: Proceedings of the twenty-\ufb01rst international conference on Machine learning, page 102, 2004.\n\n[4] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan\n\nKaufmann Publishers Inc., San Francisco, CA, USA, 1988.\n\n[5] Kevin Murphy, Yair Weiss, and Michael Jordan. Loopy belief propagation for approximate inference: An\nempirical study. In Proceedings of the 15th Annual Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI-99), pages 467\u201347, 1999.\n\n[6] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky. MAP estimation via agreement on trees: message-\n\npassing and linear programming. IEEE Transactions on Information Theory, 51(11):3697\u20133717, 2005.\n\n[7] D. Roth and W. Yih. A linear programming formulation for global inference in natural language tasks. In\n\nProc. of the Conference on Computational Natural Language Learning (CoNLL), pages 1\u20138, 2004.\n\n[8] Charles Sutton and Andrew McCallum. Collective segmentation and labeling of distant entities in infor-\n\nmation extraction. Technical Report TR # 04-49, University of Massachusetts, 2004.\n\n[9] Michael Collins. Discriminative training methods for hidden Markov models: theory and experiments\nwith perceptron algorithms. In EMNLP \u201902: Proceedings of the ACL-02 conference on Empirical methods\nin natural language processing, pages 1\u20138, 2002.\n\n[10] David A. McAllester. PAC-bayesian stochastic model selection. Machine Learning, 51(1):5\u201321, 2003.\n[11] David McAllester. Generalization bounds and consistency for structured labeling. In Predicting Struc-\n\ntured Data. MIT Press, To Appear.\n\n[12] Charles Sutton and Andrew McCallum. Piecewise training of undirected models. In 21st Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, 2005.\n\n[13] Hal Daum\u00b4e III and Daniel Marcu. Learning as search optimization: Approximate large margin methods\n\nfor structured prediction. In International Conference on Machine Learning (ICML), 2005.\n\n[14] Martin J. Wainwright. Estimating the \u201dwrong\u201d graphical model: Bene\ufb01ts in the computation-limited\n\nsetting. Journal of Machine Learning Research, 7:1829\u20131859, 2006.\n\n[15] V. Punyakanok, D. Roth, W. Yih, and D. Zimak. Learning and inference over constrained output. In Proc.\n\nof the International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 1124\u20131129, 2005.\n\n8\n\n\f", "award": [], "sourceid": 809, "authors": [{"given_name": "Alex", "family_name": "Kulesza", "institution": null}, {"given_name": "Fernando", "family_name": "Pereira", "institution": null}]}