{"title": "Partially Observed Maximum Entropy Discrimination Markov Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1977, "page_last": 1984, "abstract": "Learning graphical models with hidden variables can offer semantic insights to complex data and lead to salient structured predictors without relying on expensive, sometime unattainable fully annotated training data. While likelihood-based methods have been extensively explored, to our knowledge, learning structured prediction models with latent variables based on the max-margin principle remains largely an open problem. In this paper, we present a partially observed Maximum Entropy Discrimination Markov Network (PoMEN) model that attempts to combine the advantages of Bayesian and margin based paradigms for learning Markov networks from partially labeled data. PoMEN leads to an averaging prediction rule that resembles a Bayes predictor that is more robust to overfitting, but is also built on the desirable discriminative laws resemble those of the M$^3$N. We develop an EM-style algorithm utilizing existing convex optimization algorithms for M$^3$N as a subroutine. We demonstrate competent performance of PoMEN over existing methods on a real-world web data extraction task.", "full_text": "Partially Observed Maximum Entropy\n\nDiscrimination Markov Networks\n\nJun Zhu\u2020\n\nEric P. Xing\u2021\n\nBo Zhang\u2020\n\n\u2020State Key Lab of Intelligent Tech & Sys, Tsinghua National TNList Lab, Dept. Comp Sci & Tech,\n\u2020Tsinghua University, Beijing China. jun-zhu@mails.thu.edu.cn; dcszb@thu.edu.cn\n\u2021School of Comp. Sci., Carnegie Mellon University, Pittsburgh, PA 15213, epxing@cs.cmu.edu\n\nAbstract\n\nLearning graphical models with hidden variables can offer semantic insights to\ncomplex data and lead to salient structured predictors without relying on expen-\nsive, sometime unattainable fully annotated training data. While likelihood-based\nmethods have been extensively explored, to our knowledge, learning structured\nprediction models with latent variables based on the max-margin principle remains\nlargely an open problem. In this paper, we present a partially observed Maximum\nEntropy Discrimination Markov Network (PoMEN) model that attempts to com-\nbine the advantages of Bayesian and margin based paradigms for learning Markov\nnetworks from partially labeled data. PoMEN leads to an averaging prediction rule\nthat resembles a Bayes predictor that is more robust to over\ufb01tting, but is also built\non the desirable discriminative laws resemble those of the M3N. We develop an\nEM-style algorithm utilizing existing convex optimization algorithms for M3N as\na subroutine. We demonstrate competent performance of PoMEN over existing\nmethods on a real-world web data extraction task.\n\n1 Introduction\nInferring structured predictions based on high-dimensional, often multi-modal and hybrid covari-\nates remains a central problem in data mining (e.g., web-info extraction), machine intelligence (e.g.,\nmachine translation), and scienti\ufb01c discovery (e.g., genome annotation). Several recent approaches\nto this problem are based on learning discriminative graphical models de\ufb01ned on composite fea-\ntures that explicitly exploit the structured dependencies among input elements and structured in-\nterpretational outputs. Different learning paradigms have been explored, including the maximum\nconditional likelihood [7] and max-margin learning [2, 12, 13], with remarkable success.\nHowever, the problem of structured input/output learning can be intriguing and signi\ufb01cantly more\ndif\ufb01cult when there exist hidden substructures in the data, which is not uncommon in realistic prob-\nlems. As is well-known in the probabilistic graphical model literature, hidden variables can facili-\ntate natural incorporation of structured domain knowledge such as latent semantic concepts or unob-\nserved dependence hierarchies into the model, which can often result in more intuitive representation\nand more compact parameterization of the model; but learning a partially observed model is often\nnon-trivial because it involves optimizing against a more complex cost function, which is usually\nnot convex and requires additional efforts to impute or marginalize out hidden variables. Most exist-\ning work along this line, such as the hidden CRF for object recognition [9] and scene segmentation\n[14] and the dynamic hierarchical MRF for web data extraction [18], falls in the likelihood-based\nlearning. For the max-margin learning, which is arguably a more desirable discriminative learning\nparadigm in many application scenarios, learning a Makov network with hidden variables can be\nextremely dif\ufb01cult and little work has been done except [11], where, in order to obtain a convex pro-\ngram, the uncertainty in mixture modeling is simpli\ufb01ed by a reduction using the MAP component.\n\n1\n\n\fA major reason for the dif\ufb01culty of considering latent structures in max-margin models is the lack of\na natural probabilistic interpretation of such models, which on the other hand offers the key insight\nin likelihood-based learning to design algorithms such as EM for learning partially observed mod-\nels. Recent work on semi-supervised or unsupervised max-margin learning [1, 4, 16] was all short of\nan explicit probabilistic interpretation of their algorithms of handling latent variables. The recently\nproposed Maximum Entropy Discrimination Markov Networks (MaxEnDNet) [20, 19] represent a\nkey advance in this direction. MaxEnDNet offers a general framework to combine Bayesian-style\nlearning and max-margin learning in structured prediction. Given a prior distribution of a structured-\nprediction model, and leveraging a new prediction-rule that is based on a weighted average over an\nensemble of prediction models, MaxEnDNet adopts a structured minimum relative entropy prin-\nciple to learn a posterior distribution of the prediction model in a subspace de\ufb01ned by a set of ex-\npected margin constraints. This elegant combination of probabilistic and maximum margin concepts\nprovides a natural path to incorporate hidden structured variables in learning max-margin Markov\nnetworks (M3N), which is the focus of this paper.\nIt has been shown in [20] that, in the fully observed case, MaxEnDNet subsumes the standard M3N\n[12]. But MaxEnDNet in its full generality offers a number of important advantages while retaining\nall the merits of the M3N. For example, structured prediction under MaxEnDNet is based on an av-\neraging model and therefore enjoys a desirable smoothing effect, with a uniform convergence bound\non generalization error, as shown in [20]; MaxEnDNet admits a prior that can be designed to intro-\nduce useful regularization effects, such as a sparsity bias, as explored in the Laplace M3N [19, 20].\nIn this paper, we explore yet another advantage of MaxEnDNet stemmed from the Bayesian-style\nmax-margin learning formalism on incorporating hidden variables. We present the partially ob-\nserved MaxEnDNet (PoMEN), which offers a principled way to incorporate latent structures carry-\ning domain knowledge and learn a discriminative model with partially labeled data. The reducibil-\nity of MaxEnDNet to M3N renders many existing convex optimization algorithms developed for\nlearning M3N directly applicable as subroutines for learning our proposed model. We describe an\nEM-style algorithm for PoMEN based on existing algorithms for M3N. As a practical application,\nwe apply the proposed model to a web data extraction task\u2013product information extraction, where\ncollecting fully labeled training data is very dif\ufb01cult. The results show the promise of max-margin\nlearning as opposed to likelihood-based estimation in the presence of hidden variables.\nThe paper is organized as follows. Section 2 reviews the basic max-margin structured prediction\nformalism and MaxEnDNet. Section 3 presents the partially observed MaxEnDNet. Section 4\napplies the model to real web data extraction, and Section 5 brings this paper to a conclusion.\n\n2 Preliminaries\nOur goal is to learn a predictive function h : X 7\u2192 Y from a structured input x \u2208 X to a structured\noutput y \u2208 Y, where Y = Y1\u00d7\u00b7\u00b7\u00b7\u00d7Yl represents a combinatorial space of structured interpretations\nof multi-facet objects. For example, in part-of-speech (POS) tagging, Yi consists of all the POS tags\nand each label y = (y1,\u00b7\u00b7\u00b7 , yl) is a sequence of POS tags, and each input x is a sentence (word\nsequence). We assume that the feasible set of labels Y(x) \u2286 Y is \ufb01nite for any x.\nLet F (x, y; w) be a parametric discriminant function. A common choice of F is a linear model,\nwhere F is de\ufb01ned by a set of K feature functions fk : X \u00d7 Y 7\u2192 R and their weights wk:\nF (x, y; w) = w>f(x, y). A commonly used predictive function is:\nF (x, y; w).\n\n(1)\nBy using different loss functions, the parameters w can be estimated by maximizing the conditional\nlikelihood [7] or by maximizing the margin [2, 12, 13] on labeled training data.\n2.1 Maximum margin Markov networks\nUnder the M3N formalism, which we will generalize in this paper, given a set of fully labeled\ntraining data D = {(xi, yi)}N\ni=1, the max-margin learning [12] solves the following optimization\nproblem and achieves an optimum point estimate of the weight vector w:\n\nh0(x; w) = arg max\ny\u2208Y(x)\n\nP0 (M3N) :\n\n(2)\nwhere \u03bei represents a slack variable absorbing errors in training data, C is a positive constant, R+\ndenotes non-negative real numbers, and F0 is the feasible space for w: F0 = {w : w>\u2206fi(y) \u2265\n\nw\u2208F0,\u03be\u2208RN\n+\n\nmin\n\n\u03bei,\n\nNX\n\ni=1\n\nkwk2 + C\n\n1\n2\n\n2\n\n\fj=1\n\nI(yj 6= yi\n\n\u2206\u2018i(y) =P|xi|\n\n\u2206\u2018i(y) \u2212 \u03bei; \u2200i,\u2200y 6= yi}, of which \u2206fi(y) = f(xi, yi) \u2212 f(xi, y), w>\u2206fi(y) is the \u201cmargin\u201d\nbetween the true label yi and a prediction y, and \u2206\u2018i(y) is a loss function with respect to yi.\nVarious loss functions have been proposed for P0. In this paper, we adopt the hamming loss [12]:\nj), where I(\u00b7) is an indicator function that equals to 1 if the argument is\ntrue and 0 otherwise. The optimization problem P0 is intractable because of the exponential number\nof constraints in F0. Exploring sparse dependencies among individual labels yi in y, as re\ufb02ected\nin the speci\ufb01c design of the feature functions (e.g., based on pair-wise labeling potentials), ef\ufb01cient\noptimization algorithms based on cutting-plane [13] or message-passing [12], and various gradient-\nbased methods [3, 10] have been proposed to obtain approximate solution to P0. As described\nshortly, these algorithms can be directly employed as subroutines in solving our proposed model.\n2.2 Maximum Entropy Discrimination Markov Networks\nInstead of predicting based on a single rule F (\u00b7; w) as in M3N using w, the structured maximum\nentropy discrimination formalism [19] facilitates a Bayes-style prediction by averaging F (\u00b7; w) over\na distribution of rules according to a posterior distribution of the weights, p(w):\n\nZ\n\np(w)F (x, y; w) dw ,\n\nh1(x) = arg max\ny\u2208Y(x)\n\n(3)\nwhere p(w) is learned by solving an optimization problem referred to as a maximum entropy dis-\ncrimination Markov network (MaxEnDNet, or MEN) [20] that elegantly combines Bayesian-style\nlearning with max-margin learning. In a MaxEnDNet, a prior over w is introduced to regularize its\ndistribution, and the margins resulting from predictor (3) are used to de\ufb01ne a feasible distribution\nsubspace. More formally, given a set of fully observed training data D and a prior distribution\np0(w), MaxEnDNet solves the following problem for an optimal posterior p(w|D) or p(w):\n\nKL(p(w)||p0(w)) + U (\u03be),\n\nmin\n\nP1 (MaxEnDNet) :\n\n(4)\nwhere the objective function KL(p(w)||p0(w)) + U(\u03be) is known as the generalized entropy [8, 5],\nor regularized KL-divergence, and U(\u03be) is a closed proper convex function over the slack variables\n\u03be. U is also known as an additional \u201cpotential\u201d term in the maximum entropy principle. The feasible\ndistribution subspace F1 is de\ufb01ned as follows:\n\np(w)\u2208F1,\u03be\u2208RN\n+\n\nF1 =\n\np(w) :\n\np(w)[\u2206Fi(y; w) \u2212 \u2206\u2018i(y)] dw \u2265 \u2212\u03bei, \u2200i, \u2200y\n\nn\n\nZ\n\no\n\n,\n\nwhere \u2206Fi(y; w) = F (xi, yi; w) \u2212 F (xi, y; w).\nP1 is a variational optimization problem over p(w) in the feasible subspace F1. Since both the KL-\ndivergence and the U function in P1 are convex, and the constraints in F1 are linear, P1 is a convex\nprogram. Thus, one can apply the calculus of variations to the Lagrangian to obtain a variational\nextremum, followed by a dual transformation of P1. As proved in [20], solution to P1 leads to a\nGLIM for p(w), whose parameters are closely connected to the solution of the M3N.\nTheorem 1 (MaxEnDNet (adapted from [20])) The variational optimization problem P1 under-\nlying a MaxEnDNet gives rise to the following optimum distribution of Markov network parameters:\n(5)\n\n\u03b1i(y)[\u2206Fi(y; w) \u2212 \u2206\u2018i(y)](cid:9),\n\nZ(\u03b1) p0(w) exp(cid:8)X\n\np(w) =\n\n1\n\nwhere Z(\u03b1) is a normalization factor and the Lagrangian multipliers \u03b1i(y) (corresponding to\nconstraints in F1) can be obtained by solving the following dual problem of P1:\n\ni,y\n\nD1 :\n\n\u2212 log Z(\u03b1) \u2212 U ?(\u03b1)\n\nmax\ns.t. \u03b1i(y) \u2265 0, \u2200i, \u2200y,\n\n\u03b1\n\n(cid:0)P\ni,y \u03b1i(y)\u03bei\u2212 U(\u03be)(cid:1).\n\nwhere U ?(\u00b7) is the conjugate of the slack function U(\u00b7), i.e., U ?(\u03b1) = sup\u03be\n\nIt can be shown that when F (x, y; w) = w>f(x, y), U(\u03be) = CP\nGaussian N (w|0, I), then p(w) is also a Gaussian with shifted meanP\n\ni \u03bei, and p0(w) is a standard\ni,y \u03b1i(y)\u2206fi(y) and co-\nvariance matrix I, where the Lagrangian multipliers \u03b1i(y) can be obtained by solving problem D1\nof the form that is isomorphic to the dual of M3N. When applying this p(w) to Eq. (3), one can\nobtain a predictor that is identical to that of the M3N.\nFrom the above reduction, it should be clear that M3N is a special case of MaxEnDNet. But the\nMaxEnDNet in its full generality offers a number of important advantages while retaining all the\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) A web page with two data records containing 7 and 8 elements respectively; (b) A partial vision\ntree of the page in Figure 1(a), where grey nodes are the roots of the two records; (c) A label hierarchy for\nproduct information extraction, where the root node represents an entire instance (a web page); leaf nodes are\nthe attributes (i.e. Name, Image, Price, and Description); and inner nodes are the intermediate class labels\nde\ufb01ned for parts of a web page, e.g. {N, I} is a class label for blocks containing both Name and Image.\nmerits of the M3N. First, the MaxEnDNet prediction is based on model averaging and therefore\nenjoys a desirable smoothing effect, with a uniform convergence bound on generalization error, as\nshown in [20]. Second, MaxEnDNet admits a prior that can be designed to introduce useful regular-\nization effects, such as a sparsity bias, as explored in the Laplace M3N [19, 20]. Third, as explored\nin this paper, MaxEnDNet offers a principled way to incorporate hidden generative models underly-\ning the structured predictions, but allows the predictive model to be discriminatively trained based\non partially labeled data. In the sequel, we introduce partially observed MaxEnDNet (PoMEN), that\ncombines (possibly latent) generative model and discriminative training for structured prediction.\n3 Partially Observed MaxEnDNet\nConsider, for example, the problem of web data extraction, which is to identify interested informa-\ntion from web pages. Each sample is a data record or an entire web page which is represented as a set\nof HTML elements. One striking characteristic of web data extraction is that various types of struc-\ntural dependencies between HTML elements exist, e.g. the HTML tag tree or the Document Object\nModel (DOM) structure is itself hierarchical. In [17], fully observed hierarchical CRFs are shown\nto have great promise and achieve better performance than \ufb02at models like linear-chain CRFs [7].\nOne method to construct a hierarchical model is to \ufb01rst use a parser to construct a so called vision\ntree [17]. For example, Figure 1(b) is a part of the vision tree of the page in Figure 1(a). Then, based\non the vision tree, a hierarchical model can be constructed accordingly to extract the interested at-\ntributes, e.g. a product\u2019s name, image, price, description, etc.\nIn such a hierarchical extraction\nmodel, inner nodes are useful to incorporate long distance dependencies, and the variables at one\nlevel are re\ufb01nements of the variables at upper levels. To re\ufb02ect the re\ufb01nement relationship, the class\nlabels de\ufb01ned as in [17] are also organized in a hierarchy as in Figure 1(c). Due to concerns over\nlabeling cost and annotation-ambiguity caused by the overlapping of class labels as in Figure 1(c),\nit is desirable to effectively learn a hierarchical extraction model with partially labeled data.\nWithout loss of generality, assume that the structured labeling of a sample consists of two parts\u2014an\nobserved part y and a hidden part z. Both y and z are structured labels, and furthermore the hidden\nvariables are not isolated, but are statistically dependent on each other and on the observed data\naccording to a graphical model p(y, z, w|x) = p(w, z|x)p(y|x, z, w), where p(y|x, z, w) takes\nthe form of a Boltzmann distribution p(y|x, z, w) = 1\nZ exp{\u2212F (x, y, z; w)} and x is a global\ncondition as in CRFs [7]. Following the spirit of a margin-based structured predictor such as M3N,\nwe employ only the unnormalized energy function F (x, y, z; w) (which usually consists of linear\ncombinations of feature functions or potentials) as the cost function for structured prediction, and\nwe adopt a prediction rule directly extended from the MaxEnDNet\u2014average over all the possible\nmodels de\ufb01ned by different w, and at the same time marginalized over all hidden variables z. That is,\n\nZ\n\nX\n\nh2(x) = arg max\ny\u2208Y(x)\n\n(6)\nNow our problem is learning the optimum p(w, z) from data. Let {z} \u2261 (z1, . . . , zN ) denote the\nensemble of hidden labels of all the samples. Analogous to the setup for learning the MaxEnDNet,\nwe specify a prior distribution p0({z}) over all the hidden structured labels. The feasible space F2\nof p(w,{z}) can be de\ufb01ned as follows according to the margin constraints:\n\np(w, z)F (x, y, z; w) dw .\n\nz\n\nn\n\nZ\n\nX\n\nF2 =\n\np(w,{z}) :\n\np(w, z)[\u2206Fi(y, z; w) \u2212 \u2206\u2018i(y)] dw \u2265 \u2212\u03bei, \u2200i, \u2200y\n\no\n\n,\n\nz\n\n4\n\n\fwhere \u2206Fi(y, z; w) = F (xi, yi, z; w)\u2212 F (xi, y, z; w), and p(w, z) is the marginal distribution of\np(w,{z}) on a single sample, which will be used in (6) to compute the structured prediction.\nAgain we learn the optimum p(w,{z}) based on a structured minimum relative entropy principle\nas in MaxEnDNet. Speci\ufb01cally, let p0(w,{z}) represent a given joint prior over the parameters and\nthe hidden variables, we de\ufb01ne the PoMEN problem that gives rise to the optimum p(w,{z}):\n\nmin\n\nKL(p(w,{z})||p0(w,{z})) + U (\u03be).\n\nP2 (PoMEN) :\n\np(w,{z})\u2208F2,\u03be\u2208RN\n+\n\n(7)\nAnalogous to P1, P2 is a variational optimization problem over p(w,{z}) in the feasible space F2.\nAgain since both the KL and the U function in P2 are convex, and the constraints in F2 are linear,\nP2 is a convex program. Thus, we can employ a technique similar to that used to solve MaxEnDNet\nto solve the PoMEN problem.\n\nthat is, p0(w,{z}) = p0(w)QN\n\ni=1 p0(zi) and p(w,{z}) = p(w)QN\n\n3.1 Learning PoMEN\nFor a fully general p(w,{z}) where hidden variables in all samples are coupled, solving P2 based on\nan extension of Theorem 1 would involve very high-dimensional integration and summation that is\nin practice intractable. In this paper we consider a simpler case where the hidden labels of different\nsamples are iid and independent of the parameter w in both the prior and the posterior distributions,\ni=1 p(zi). This assumption\nwill hold true in a graphical model where w corresponds to only the observed y variables at the\nbottom of a hierarchical model. For many practical applications such as the hierarchical web-info\nextraction, such a model is realistic and adequate. For more general models where dependencies are\nmore global, we can use the above factored model as a generalized mean \ufb01eld approximation to the\ntrue distribution, but this extension is beyond the scope of this paper, and will be explored later in\nthe full paper. Generalizing Theorem 1, following a coordinate descent principle, now we present\nan alternating minimization (EM-style) procedure for P2:\nStep 1: keep p(z) \ufb01xed, infer p(w) by solving the following problem:\n\u03bei,\n\nKL(p(w)||p0(w)) + C\n\nR p(w)Ep(z)[\u2206Fi(y, z; w) \u2212 \u2206\u2018i(y)] dw \u2265 \u2212\u03bei, \u2200i, \u2200y}, which is a\n\nwhere F0\ngeneralized version of F1 with hidden variables. Thus, we can apply the same convex optimization\ntechniques as being used for solving the problem P1. Speci\ufb01cally, assume that the prior distribution\np0(w) is a standard normal and F (x, y, z; w) = w>f(x, y, z), then the solution (i.e. posterior\ni,y \u03b1i(y)Ep(z)[\u2206fi(y, z)]. The dual variables\n\ndistribution) is p(w) = N (w|\u00b5w, I), where \u00b5w =P\nkX\nwhere P(C) = {\u03b1 : P\n\nX\n(9)\ny \u03b1i(y) = C; \u03b1i(y) \u2265 0, \u2200i, \u2200y}. This dual problem is isomorphic\nto the dual form of the M3N optimization problem, and we can use existing algorithms developed\nfor M3N, such as [12, 3] to solve it. Alternatively, we can solve the following primal problem via\nemploying existing subgradient [10] or cutting plane [13] algorithms:\n\n\u03b1 are achieved by solving a dual problem:\n\n\u03b1i(y)Ep(z)[\u2206fi(y, z)]k2,\n\n\u03b1i(y)\u2206\u2018i(y) \u2212 1\n2\n\n1 = {p(w) :\n\nmin\np(w)\u2208F0\n\nX\n\nmax\n\u03b1\u2208P(C)\n\n1,\u03be\u2208RN\n\n(8)\n\ni,y\n\ni,y\n\n+\n\ni\n\nNX\n\n>\n\n1\n2\n\n+\n\nw\n\ni=1\n\n\u03bei,\n\nw\u2208F0\n\nw + C\n\nmin\n0,\u03be\u2208RN\n\n(10)\n0 = {w : w>Ep(z)[\u2206fi(y, z)] \u2265 \u2206\u2018i(y) \u2212 \u03bei; \u03bei \u2265 0, \u2200i, \u2200y}, which is a generalized\nwhere F0\nversion of F0. It is easy to show that the solution to this primal problem is the posterior mean of\np(w), which will be used to make prediction in the predictive function h2. Note that the primal\nproblem is very similar to that of M3N, except the expectations in F0\n0. This is not surprising since it\ncan be shown that M3N is a special case of MaxEnDNet. We will discuss how to ef\ufb01ciently compute\nthe expectations Ep(z)[\u2206fi(y, z)] in Step 2.\n\nStep 2: keep p(w) \ufb01xed, based on the factorization assumption p({z}) = Q\np0({z}) = Q\n\ni p(zi) and\ni p0(zi), the distribution p(z) for each sample i can be obtained by solving the\n\nfollowing problem:\n\np(z)\u2208F ?\n\nmin\n1 ,\u03bei\u2208R+\n\nKL(p(z)||p0(z)) + C\u03bei,\n\n(11)\n\n5\n\n\f1 = {p(z) : P\n\n1 = {p(z) : P\nwhere F ?\nis a normal distribution as shown in Step 1, F ?\nnX\n\u2212\u03bei, \u2200y}. Similarly, by introducing a set of Lagrangian multipliers \u03b2(y), we can get:\np0(z) exp{X\n\nz p(z)R p(w)[w>\u2206fi(y, z) \u2212 \u2206\u2018i(y)] dw \u2265 \u2212\u03bei, \u2200y}. Since p(w)\nw\u2206fi(y, z) \u2212 \u2206\u2018i(y)(cid:3) \u2265\no\nw\u2206fi(y, z) \u2212 \u2206\u2018i(y)]}(cid:17)\n\n(12)\ny \u03b2(y) = C, \u03b2(y) \u2265 0, \u2200y}. This non-linear constrained optimization\nproblem can be solved with existing solvers, like IPOPT [15]. With a little algebra, we can compute\nthe gradients as follows:\n\nand the dual variables \u03b2(y) can be obtained by solving the following dual problem:\n\nwhere Pi(C) = {P\n\nz p(z)[\u00b5>\nw\u2206fi(y, z) \u2212 \u2206\u2018i(y)]\n>\n\n(cid:16)X\n\np0(z) exp\n\n\u2212 log\n\nmax\n\n\u03b2\u2208Pi(C)\n\n\u03b2(y)[\u00b5\n\n1\n\nZ(\u03b2)\n\n>\n\n\u03b2(y)[\u00b5\n\nz\n\ny\n\np(z) =\n\n,\n\n,\n\ny\n\n\u2202log Z(\u03b2)\n\n\u2202\u03b2(y)\n\nwEp(z)[\u2206fi(y, z)] \u2212 \u2206\u2018i(y).\n>\n\n= \u00b5\n\nTo ef\ufb01ciently calculate the expectations Ep(z)[\u2206fi(y, z)] as required in Step1 and in the above gra-\ndients. We make a gentle assumption that the prior distribution p0(z) is an exponential distribution\nof the following form:\n\nnX\n\no\n\np0(z) = exp\n\n\u03c6m(z)\n\n.\n\n(13)\n\nm\n\nample,\n\ni. Log-linear Prior: de\ufb01ned by a set of feature functions and their weights.\n\nThis assumption is general enough for our purpose, and covers the following commonly used priors:\nFor ex-\nin a pairwise Markov network, we can de\ufb01ne the prior model as: p0(z) \u221d\n(i,j)\u2208E\n\nexp(cid:8)P\nk \u03bbkgk(zi, zj)(cid:9), where gk(zi, zj) are feature functions and \u03bbk are weights.\nP\nii. Independent Prior: de\ufb01ned as p0(z) =Q\u2018\nas: p0(z) = exp{P\u2018\nple, for a chain graph, the prior distribution can be written as: p0(z) = p(z1)Q\u2018\nSimilarly, in the logarithm space, p0(z) = exp{log p0(z1) +P\u2018\n\niii. Markov Prior: the prior model have the Markov property w.r.t the model\u2019s structure. For exam-\nj=2 p0(zj|zj\u22121).\n\nj=1 p0(zj). In the logarithm space, we can write it\n\nj=1 log p0(zj)}.\n\nj=2 log p0(zj|zj\u22121)}.\n\nWith the above assumption, p(z) is an exponential family distribution, and the expectations,\nEp(z)[\u2206fi(y, z)], can be ef\ufb01ciently calculated by exploring the sparseness of the model\u2019s structure\nto compute marginal probabilities, e.g. p(zi) and p(zi, zj) in pairwise Markov networks. When the\nmodel\u2019s tree width is not large, this can be done exactly. For complex models, approximate inference\nlike loopy belief propagation and variational methods can be applied. However, since the number of\nconstraints in (12) is exponential to the size of the observed labels, the optimization problem cannot\nbe ef\ufb01ciently solved. A key observation, as explored in [12], is that we can interpret \u03b2(y) as a prob-\ny \u03b2(y) = C, \u03b2(y) \u2265 0, \u2200y. Thus,\nwe can introduce a set of marginal dual variables and transfer the dual problem (12) to an equivalent\nform with a polynomial number of constraints. The derivatives with respect to each marginal dual\nparameter is of the same structure as the above gradients.\n\nability distribution of y because of the regularity constraints:P\n\n4 Experiments\nWe apply PoMEN to the problem of web data extraction, and compare it with partially observed\nCRFs (PoHCRF) [9], and fully observed hierarchical CRFs (HCRF) [17] and hierarchical M3N\n(HM3N) which has the same hierarchical model structure as the HCRF.\n\n4.1 Data Sets, Evaluation Criteria, and Prior for Latent Variables\nWe concern ourselves with the problem of identifying product items for sale on the web. For each\nproduct item, four attributes \u2013 Name, Image, Price, and Description are extracted in our experiments.\nThe evaluation data consists of product web pages generated from 37 different templates. For each\ntemplate, there are 5 pages for training and 10 for testing. We evaluate all the methods on two\ndifferent levels of inputs, record level and page level. For record-level evaluation, we assume that\ndata records are given, and we compare different models on accuracy of extracting attributes in the\ngiven records. For page-level evaluation, the inputs are raw web pages and all the models perform\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: (a) The F1 and block instance accuracy of record-level evaluation from 4 models under different\namount of training data. (b) The F1 and its variance on the attributes: Name, Image, Price, and Description.\n\n(a)\n\n(b)\n\nFigure 3: The average F1 and block instance accuracy of different models with different ratios of training data\nfor two types of page-level evaluation: (a) ST1; and (b) ST2.\nboth record detection and attribute extraction simultaneously as in [17]. In the 185 training pages,\nthere are 1585 data records in total; in the 370 testing pages, 3391 data records are collected. As\nfor evaluation criteria, we use the standard precision, recall, and their harmonic value F1 for each\nattribute and the two comprehensive measures, i.e. average F1 and block instance accuracy, as\nde\ufb01ned in [17]. We adopt an independent prior described earlier for the latent variables, each factor\np0(zi) over a single latent label is assumed to be uniform.\n\n4.2 Record-Level Evaluation\nIn this evaluation, partially observed training data are the data records whose leaf nodes are labeled\nand inner nodes are hidden. We randomly select m = 5, 10, 20, 30, 40, or, 50 percent of the training\nrecords as training data, and test on all the testing records. For each m, 10 independent experiments\nwere conducted and the average performance is summarized in Figure 2. From Figure 2(a), it can\nbe seen that the HM3N performs slightly better than HCRF trained on fully labeled data. For the\ntwo partially observed models, PoMEN performs much better than PoHCRF in both average F1\nand block instance accuracy, and with lower variances of the score, especially when the training\nset is small. As the number of training data increases, PoMEN performs comparably w.r.t.\nthe\nfully observed HM3N. For all the models, higher scores and lower variances are achieved with\nmore training data. Figure 2(b) shows the F1 score on each attribute. Overall, for attributes Image,\nPrice, and Description, although all models generally perform better with more training data, the\nimprovement is small; and the differences between different models are small. This is possibly\nbecause the features of these attributes are usually consistent and distinctive, and therefore easier to\nlearn and predict. For the attribute Name, however, a large number of training data are needed to\nlearn a good model because its underlying features have diverse appearance on web pages.\n\n4.3 Page-Level Evaluation\nExperiments on page-level prediction is conducted similarly as above, and the results are summa-\nrized in Figure 3. Two different partial labeling strategies are used to generate training data. ST1:\nlabel the leaf nodes and the nodes that represent data records; ST2: label more information based\non ST1, e.g., label also the nodes above the \u201cData Record\u201d nodes in the hierarchy as in Figure 1(c).\nDue to space limitation, we only report average F1 and block instance accuracy.\nFor ST1, PoMEN achieves better scores and lower variances than PoHCRF in both average F1 and\nblock instance accuracy. The HM3N performs slightly better than HCRF (both trained on full label-\ning), and PoMEN performs comparably with the fully observed HCRF in block instance accuracy.\nFor ST2, with more supervision information, PoHCRF achieves higher performance that is compa-\nrable to that of HM3N in average F1, but slightly lower than HM3N in block instance accuracy. For\n\n7\n\n020400.850.860.870.880.890.90.910.92Training RatioAverage F1 HCRFPoHCRFHM3NPoM3N020400.650.70.750.80.85Training RatioBlock Instance Accuracy HCRFPoHCRFHM3NPoM3N0500.60.650.70.750.80.850.9Training RatioF1Name HCRFPoHCRFHM3NPoM3N0500.930.9350.940.9450.950.9550.960.9650.97Training RatioF1Image HCRFPoHCRFHM3NPoM3N0500.930.940.950.960.970.98Training RatioF1Price HCRFPoHCRFHM3NPoM3N0500.780.790.80.810.820.830.840.850.86Training RatioF1Description HCRFPoHCRFHM3NPoM3N020400.60.650.70.750.80.850.90.951Training RatioAverage F1 020400.60.650.70.750.80.850.90.951Training RatioBlock Instance Accuracy HCRFPoHCRFHM3NPoM3NHCRFPoHCRFHM3NPoM3N020400.60.650.70.750.80.850.90.951Training RatioAverage F1 020400.60.650.70.750.80.850.90.951Training RatioBlock Instance Accuracy HCRFPoHCRFHM3NPoM3NHCRFPoHCRFHM3NPoM3N\fthe latent models, PoHCRF performs slightly better in average F1, and PoMEN does better in block\ninstance accuracy; moreover, the variances of PoMEN are much smaller than those of PoHCRF in\nboth average F1 and block instance accuracy. We can also see that PoMEN does not change much\nwhen additional label information is provided in ST2. Thus, the max-margin principle could provide\na better paradigm than the likelihood-based estimation for learning latent hierarchical models.\nFor the second step of learning PoMEN, the IPOPT solver [15] was used to compute the distribution\np(z). Interestingly, the performance of PoMEN does not change much during the iteration, and\nour results were achieved within 3 iterations. It is possible that in hierarchical models, since inner\nvariables usually represent overlapping concepts, the initial distribution are already reasonably good\nto describe con\ufb01dence on the labeling due to implicit consistence across the labels. This is unlike\nthe multi-label learning [6] where only one of the multiple labels is true and during the iteration\nmore probability mass should be redistributed on the true label during the EM iterations.\n5 Conclusions\nWe have presented an extension of the standard max-margin learning to address the challenging\nproblem of learning Markov networks with the existence of structured hidden variables. Our ap-\nproach is a generalization of the maximum entropy discrimination Markov networks (MaxEnDNet),\nwhich offer a general framework to combine Bayesian-style and max-margin learning and subsume\nthe standard M3N as a special case, to consider structured hidden variables. For the partially ob-\nserved MaxEnDNet, we developed an EM-style algorithm based on existing convex optimization\nalgorithms developed for the standard M3N. We applied the proposed model to a real-world web\ndata extraction task and showed that learning latent hierarchical models based on the max-margin\nprinciple could be better than the likelihood-based learning with hidden variables.\n\nAcknowledgments\nThis work was done while J.Z. was a visiting researcher at CMU under a State Scholarship from China, and sup-\nports from NSF DBI-0546594 and DBI-0640543 awarded to E.X.; J.Z. and B.Z. are also supported by Chinese\nNSF Grant 60621062 and 60605003; National Key Foundation R&D Projects 2003CB317007, 2004CB318108\nand 2007CB311003; and Basic Research Foundation of Tsinghua National Lab for Info Sci & Tech.\nReferences\n[1] Y. Altun, D. McAllester, and M. Belkin. Maximum margin semi-supervised learning for structured\n\nvariables. In NIPS, 2006.\n\n[2] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In ICML, 2003.\n[3] P. Bartlett, M. Collins, B. Taskar, and D. McAllester. Exponentiated gradient algorithms for larg-margin\n\nstructured classi\ufb01cation. In NIPS, 2004.\n\n[4] U. Brefeld and T. Scheffer. Semi-supervised learning for structured output variables. In ICML, 2006.\n[5] M. Dud\u00b4\u0131k, S.J. Phillips, and R.E. Schapire. Maximum entropy density estimation with generalized\n\nregularization and an application to species distribution modeling. JMLR, (8):1217\u20131260, 2007.\n\n[6] R. Jin and Z. Ghahramani. Learning with multiple labels. In NIPS, 2002.\n[7] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In ICML, 2001.\n\n[8] G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, 2001.\n[9] A. Quattoni, M. Collins, and T. Darrell. Conditional random \ufb01elds for object recognition. In NIPS, 2004.\n[10] N.D. Ratliff, J.A. Bagnell, and M.A. Zinkevich. (online) subgradient methods for structured prediction.\n\nIn AISTATS, 2007.\n\n[11] F. Sha and L. Saul. Large margin hidden markov models for automatic speech recognition. In NIPS, 2006.\n[12] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.\n[13] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun.\n\nSupport vector machine learning for\n\ninterdependent and structured output spaces. In ICML, 2004.\n\n[14] J. Verbeek and B. Triggs. Scene segmentation with conditional random \ufb01elds learned from partially\n\nlabeled images. In NIPS, 2007.\n\n[15] A. W\u00a8achter and L.T. Biegler. On the implementation of a primal-dual interior point \ufb01lter line search\n\nalgorithm for large-scale nonlinear programming. Mathematical Programming, (106(1)):25\u201357, 2006.\n\n[16] L. Xu, D. Wilkinson, F. Southey, and D. Schuurmans. Discriminative unsupervised learning of structured\n\npredictors. In ICML, 2006.\n\nin web data extraction. In SIGKDD, 2006.\n\nto web data extraction. In ICML, 2007.\n\n[17] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling\n\n[18] J. Zhu, Z. Nie, B. Zhang, and J.-R. Wen. Dynamic hierarchical markov random \ufb01elds and their application\n\n[19] J. Zhu, E.P. Xing, and B. Zhang. Laplace maximum margin markov networks. In ICML, 2008.\n[20] J. Zhu, E.P. Xing, and B. Zhang. Maximum entropy discrimination markov networks. Technical Report\n\nCMU-ML-08-104, Machine Learning Department, Carnegie Mellon University, 2008.\n\n8\n\n\f", "award": [], "sourceid": 185, "authors": [{"given_name": "Jun", "family_name": "Zhu", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}, {"given_name": "Bo", "family_name": "Zhang", "institution": null}]}