{"title": "Discriminative Learning for Label Sequences via Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1008, "abstract": null, "full_text": "Discriminative Learning for Label \n\nSequences via Boosting \n\nYasemin Altun, Thomas Hofmann and Mark Johnson* \n\nDepartment of Computer Science \n\n*Department of Cognitive and Linguistics Sciences \n\nBrown University, Providence, RI 02912 \n\n{altun,th}@cs.brown.edu, Mark_Johnson@brown.edu \n\nAbstract \n\nThis paper investigates a boosting approach to discriminative \nlearning of label sequences based on a sequence rank loss function. \nThe proposed method combines many of the advantages of boost(cid:173)\ning schemes with the efficiency of dynamic programming methods \nand is attractive both, conceptually and computationally. In addi(cid:173)\ntion, we also discuss alternative approaches based on the Hamming \nloss for label sequences. The sequence boosting algorithm offers an \ninteresting alternative to methods based on HMMs and the more \nrecently proposed Conditional Random Fields. Applications areas \nfor the presented technique range from natural language processing \nand information extraction to computational biology. We include \nexperiments on named entity recognition and part-of-speech tag(cid:173)\nging which demonstrate the validity and competitiveness of our \napproach. \n\n1 \n\nIntroduction \n\nThe problem of annotating or segmenting observation sequences arises in many \napplications across a variety of scientific disciplines, most prominently in natural \nlanguage processing, speech recognition, and computational biology. Well-known \napplications include part-of-speech (POS) tagging, named entity classification, in(cid:173)\nformation extraction, text segmentation and phoneme classification in text and \nspeech processing [7] as well as problems like protein homology detection, secondary \nstructure prediction or gene classification in computational biology [3]. \n\nUp to now, the predominant formalism for modeling and predicting label sequences \nhas been based on Hidden Markov Models (HMMs) and variations thereof. Yet, \ndespite its success, generative probabilistic models - of which HMMs are a special \ncase - have two major shortcomings, which this paper is not the first one to point \nout. First, generative probabilistic models are typically trained using maximum \nlikelihood estimation (MLE) for a joint sampling model of observation and label \nsequences. As has been emphasized frequently, MLE based on the joint probability \nmodel is inherently non-discriminative and thus may lead to suboptimal prediction \naccuracy. Secondly, efficient inference and learning in this setting often requires \n\n\fto make questionable conditional independence assumptions. More precisely, in the \ncase of HMMs, it is assumed that the Markov blanket of the hidden label variable at \ntime step t consists of the previous and next labels as well as the t-th observation. \nThis implies that all dependencies on past and future observations are mediated \nthrough neighboring labels. \nIn this paper, we investigate the use of discriminative learning methods for learning \nlabel sequences. This line of research continues previous approaches for learning \nconditional models, namely Conditional Random Fields (CRFs) [6], and discrim(cid:173)\ninative re-ranking [1, 2] . CRFs have two main advantages compared to HMMs: \nThey are trained discriminatively by maximizing a conditional (or pseudo-) likeli(cid:173)\nhood criterion and they are more flexible in modeling additional dependencies such \nas direct dependencies of the t-th label on past or future observations. However, we \nstrongly believe there are two further lines of research that are worth pursuing and \nmay offer additional benefits or improvements. \n\nFirst of all, and this is the main emphasis of this paper, an exponential loss function \nsuch as the one used in boosting algorithms [9,4] may be preferable to the logarith(cid:173)\nmic loss function used in CRFs. In particular we will present a boosting algorithm \nthat has the additional advantage of performing implicit feature selection, typically \nresulting in very sparse models. This is important for model regularization as well \nas for reasons of efficiency in high dimensional feature spaces. Secondly, we will \nalso discuss the use of loss functions that explicitly minimize the zer%ne loss on \nlabels, i.e. the Hamming loss, as an alternative to loss functions based on ranking \nor predicting entire label sequences. \n\n2 Additive Models and Exponential Families \n\nFormally, learning label sequences is a generalization of the standard supervised clas(cid:173)\nsification problem. The goal is to learn a discriminant function for sequences, i.e. a \nmapping from observation sequences X = (X1,X2, ... ,Xt, ... ) to label sequences \ny = (Y1, Y2, ... , Yt, ... ). The availability of a training set of labeled sequences \nX == {(Xi, yi) : i = 1, ... ,n} to learn this mapping from data is assumed. \nIn this paper, we focus on discriminant functions that can be written as additive \nmodels. The models under consideration take the following general form: \n\nFe(X , Y) = L Fe(X, Y; t), with Fe(X, Y; t) = L fh!k(X , Y ; t) \n\n(1) \n\nk \n\nHere fk denotes a (discrete) feature in the language of maximum entropy mod(cid:173)\neling, or a weak learner in the language of boosting. In the context of label se-\nquences fk will typically be either of the form f~1)(Xt+s,Yt) (with S E {-l , O, l}) \nor f~2) (Yt-1, Yt). The first type of features will model dependencies between the \nobservation sequence X and the t-th label in the sequence, while the second type \nwill model inter-label dependencies between neighboring label variables. For ease \nof presentation, we will assume that all features are binary, i.e. each learner corre(cid:173)\nsponds to an indicator function. A typical way of defining a set of weak learners is \nas follows: \n\n(1) ( \n\n) \nfk Xt+s , Yt \n(2) ( \n) \nfk Yt-1, Yt \n\nJ(Yt, y(k))Xdxt+s) \n\n(2) \n\n(3) \nwhere J denotes the Kronecker-J and Xk is a binary feature function that extracts \na feature from an observation pattern; y(k) and y(k) refer to the label values for \nwhich the weak learner becomes \"active\". \n\nJ(Yt ,y(k))J(Yt-1 ,y(k)) . \n\n\fThere is a natural way to associate a conditional probability distribution over label \nsequences Y with an additive model Fo by defining an exponential family for every \nfixed observation sequence X \n\nPo(YIX) == exp~:(~; Y)], Zo(X) == Lexp[Fo(X,Y)]. \n\n(4) \n\ny \n\nThis distribution is in exponential normal form and the parameters B are also called \nnatural or canonical parameters. By performing the sum over the sequence index \nt, we can see that the corresponding sufficient statistics are given by Sk(X, Y) == \n2:t h(X, Y; t). These sufficient statistics simply count the number of times the \nfeature fk has been \"active\" along the labeled sequence (X, Y). \n\n3 Logarithmic Loss and Conditional Random Fields \n\nIn CRFs, the log-loss of the model with parameters B w.r.t. a set of sequences X \nis defined as the negative sum of the conditional probabilities of each training label \nsequence given the observation sequence, \n\nAlthough [6] has proposed a modification of improved iterative scaling for parameter \nestimation in CRFs, gradient-based methods such as conjugate gradient descent \nhave often found to be more efficient for minimizing the convex loss function in \nEq. (5) (cf. [8]). The gradient can be readily computed as \n\n(6) \n\nwhere expectations are taken w.r.t. Po(YIX). The stationary equations then simply \nstate that uniformly averaged over the training data, the observed sufficient statis(cid:173)\ntics should match their conditional expectations. Computationally, the evaluation \nof S(Xi, yi) is straightforward counting, while summing over all sequences Y to \ncompute E [S(X, Y)IX = Xi] can be performed using dynamic programming, since \nthe dependency structure between labels is a simple chain. \n\n4 Ranking Loss Functions for Label Sequences \n\nAs an alternative to logarithmic loss functions, we propose to minimize an upper \nbound on the ranking loss [9] adapted to label sequences. The ranking loss of a \ndiscriminant function Fo w.r.t. a set of training sequences is defined as \n\n1{rnk(B;X) = L L 8(Fo(Xi,Y) _FO(Xi,yi)), 8(x) == {~ ~~~:r~~e (7) \n\ni Y;iY; \n\nwhich is simply the sum of the number of label sequences that are ranked higher than \nor equal to the true label sequence over all training sequences. It is straightforward \nto see (based on a term by term comparison) that an upper bound on the rank loss \nis given by the following exponential loss function \n\n1{exp(B; X) == L L exp [FO(Xi, Y) - FO(Xi, yi)] = L [Po (~iIXi) -1].(8) \n\ni Y#Y' \n\ni \n\n0 \n\n\fInterestingly this simply leads to a loss function that uses the inverse conditional \nprobability of the true label sequence, if we define this probability via the expo(cid:173)\nnential form in Eq. (4). Notice that compared to [1], we include all sequences and \nnot just the top N list generated by some external mechanism. As we will show \nshortly, an explicit summation is possible because of the availability of dynamic \nprogramming formulation to compute sums over all sequences efficiently. \n\nIn order to derive gradient equations for the exponential loss we can simply make \nuse of the elementary facts \n\n\\le(-logP(()))=- P(()) \n\n\\1 eP(()) \n\n\\1 eP(()) \n, and\\le p (())=- P(())2 \n\n1 \n\n\\le(-logP(())) \n\nP(()) \n\n(9) \n\nThen it is easy to see that \n\n(10) \n\nThe only difference between Eq. (6) and Eq. (10) is the non-uniform weighting of \ndifferent sequences by their inverse probability, hence putting more emphasis on \ntraining label sequences that receive a small overall (conditional) probability. \n\n5 Boosting Algorithm for Label Sequences \n\nAs an alternative to a simple gradient method, we now turn to the derivation of \na boosting algorithm, following the boosting formulation presented in [9]. Let us \nintroduce a relative weight (or distribution) D(i , Y) for each label sequence Y \nw.r.t. a training instance (Xi, yi), i.e. L i Ly D(i , Y) = 1, \n\nD(i, Y) \n\nexp [Fe (Xi, Y) - Fe (Xi, yi)] \n\nLj, LY,#Yj exp [Fe(Xj , Y') - Fe (Xj, y j)]' \n\nfor Y 1- y i (11) \n\n. Pe(YIXi) \n\n. _ \n\nPe(yi IXi) - l - 1 \n\nD(z) 1 _ Pe(yiIXi) ' D(z) = Lj [Pe(yjIXj) -l _ 1] \n\n(12) \n\nIn addition, we define D(i, y i) = O. Eq. (12) shows how we can split D(i, Y) into \na relative weight for each training instance, given by D(i) , and a relative weight of \neach sequence, given by the re-normalized conditional probability Pe(YIXi ). Notice \nthat D(i) --+ 0 as we approach the perfect prediction case of Pe(yi IXi) --+ 1. \n\nWe define a boosting algorithm which in each round aims at minimizing the par(cid:173)\ntition function or weight normalization constant Zk w.r.t. a weak learner fk and a \ncorresponding optimal parameter increment L,()k \n\nZk(L,()k) == \"D(i)\" P~~IXli) .) exp [L,()k(Sk(Xi, Y)-Sk(Xi, yi))](13) \n\nY # Y ' \n\n~ ~ . 1- e Y\u00b7X\u00b7 \n\u2022 \n= ~ ( ~ D(i)P(bIXi; k)) exp [bL,()k], \n\n(14) \n\nwhere Pe(bIXi; k) = LY EY (b; X i) Pe(YIXi)/(l - Pe(yi IXi)) and Y(b; Xi) == {Y : \nY 1- y i 1\\ (Sk(Xi,Y) - Sk(Xi,yi)) = b}. This minimization problem is only \ntractable if the number of features is small, since a dynamic programming run \nwith accumulators [6] for every feature seems to be required in order to compute \n\n\fthe probabilities Po(bIXi; k), i.e. the probability for the k-th feature to be active \nexactly b times, conditioned on the observation sequence Xi. \nIn cases, where this is intractable (and we assume this will be the case in most \napplications), one can instead minimize an upper bound on every Zk' The general \nidea is to exploit the convexity of the exponential function and to bound \n\n(15) \n\nwhich is valid for every x E [xmin; xmax]. \nWe introduce the following shorthand notation Uik(Y) == Sk(Xi,Y) - SdXi,yi), \n(Y) min-\nmax -\n, Uk \nUik \n= \nmini u'[kin and 7fi(Y) == Po(YIXi )!(1 - Po(yiIXi) ) which allows us to rewrite \n\n, Uik = mmy:;tyi Uik \n\n= maxy:;tyi Uik \n\nmax min - ' \n\n(Y) max _ \n\n- maxi Uik \n\n, Uk \n\nZk(L.Bk) = LD(i) L \n\n7fi(Y) exp [L.BkUik(Y)] \n\n(16) \n\ny:;tyi \n\n< \" D(i) \"\n- ~ ~ \n\n7fi(Y) [u'[kax - Uik(:) eL:o.Oku,&;n + Uik(Y) - u~in eL:o.Oku,&ax] \n\ni \n\ntk \n\ny:;tyi \n\nuI?ax - uI?m \ntk \n\nuI?ax - uI?m \ntk \nLD(i) (rikeMkU,&;n + (1- rik)eMkU,&aX), where \ni \n\"\n~ \ny:;tyi \n\n7fi(Y) u'[kax - Uik(:) \nuI?ax _ u mm \ntk \n\ntk \n\ntk \n\nrik == \n\n(17) \n\n(18) \n\nBy taking the second derivative w.r.t. L.Bk it is easy to verify that this is a convex \nfunction in L.Bk which can be minimized with a simple line search. \n\nIf one is willing to accept a looser bound, one can instead work with the inter(cid:173)\nval [uk'in; uk'ax] which is the union of the intervals [u'[kin; u'[kax] for every training \nsequence i and obtain the upper bound \n\nZk(L.Bk) < rkeMkuk';n + (1 _ rk)eL:o.Okuk'ax \n\n\"D(i) \" \n~ ~ \n\n7fi(Y) uk'ax - Uik(:) \n\nu max _umm \n\ni \n\ny=/-yi \n\nk \n\nk \n\nWhich can be solved analytically \n\nL.B -\n\n1 \n\nk - uk'ax _ uk'in g \n\n10 \n\n( \n\n-rkuk'in \n\n(1 - rk)Uk'ax \n\n) \n\n(19) \n\n(20) \n\n(21) \n\nbut will in general lead to more conservative step sizes. \n\nThe final boosting procedure picks at every round the feature for which the upper \nbound on Zk is minimal and then performs an update of Bk +- Bk + L.Bk. Of course, \none might also use more elaborate techniques to find the optimal L.Bk, once !k \nhas been selected, since the upper bound approximation may underestimate the \noptimal step sizes. It is important to see that the quantities involved (rik and rk, \nrespectively) are simple expectations of sufficient statistics that can be computed for \nall features simultaneously with a single dynamic programming run per sequence. \n\n6 Hamming Loss for Label Sequences \n\nIn many applications one is primarily interested in the label-by-labelloss or Ham(cid:173)\nming loss [9]. Here we investigate how to train models by minimizing an upper \n\n\fbound on the Hamming loss. The following logarithmic loss aims at maximizing \nthe log-probability for each individual label and is given by \n\nF1og(B;X) == - LL)og Po(y1I Xi ) = - LLlog L PO(YIXi ). \n\n(22) \n\nAgain, focusing on gradient descent methods, the gradient is given by \n\nv:Yt = Y; \n\nAs can be seen, the expected sufficient statistics are now compared not to their \nempirical values, but to their expected values, conditioned on a given label value \nY; (and not the entire sequence Vi). In order to evaluate these expectations, one \ncan perform dynamic programming using the algorithm described in [5], which \nhas (independently of our work) focused on the use of Hamming loss functions in \nthe context of CRFs. This algorithm has the complexity of the forward-backward \nalgorithm scaled by a constant. \n\nSimilar to the log-loss case, one can define an exponential loss function that corre(cid:173)\nsponds to a margin-like quantity at every single label. We propose minimizing the \nfollowing loss function \n\n~ ~ ~ exp [F'(X;, Y) -log Y'~\": exp [Fo(X\" V')] ]<24) \nL \n. t l:v Yt=y i exp [FO(Xi, Y)] \n\n= LR ( iIXi'B) - l \n\nl:vexp [FO(Xi,y)] \n\n0 Yt \n\n(25) \n\n2, \n\n' t \n\n. t \n\n2 , \n\n, \n\nAs a motivation, we point out that for the case of sequences of length 1, this \nwill reduce to the standard multi-class exponential loss. Effectively in this model, \nthe prediction of a label Yt will mimic the probabilistic marginalization, i.e. y; = \nargmaxy FO(Xi, Y; t), FO(Xi, Y; t) = log l:v:Yt=Y exp [FO(Xi, Y)]. \n\nSimilar to the log-loss case, the gradient is given by \n\n_ \"E [S(X , Y)IX = Xi ,Yt = yn ~ E [S(Xi, Y)IX = Xi] (26) \nit' \n\nPo(y:IX') \n\nAgain, we see the same differences between the log-loss and the exponential loss, but \nthis time for individual labels. Labels for which the marginal probability Po (yf IXi) \nis small are accentuated in the exponential loss. The computational complexity for \ncomputing \\7 oFexp and \\7 oF1og is practically the same. We have not been able to \nderive a boosting formulation for this loss function, mainly because it cannot be \nwritten as a sum of exponential terms. We have thus resorted to conjugate gradient \ndescent methods for minimizing Fexp in our experiments. \n\n7 Experimental Results \n\n7 .1 Named Entity Recognition \n\nNamed Entity Recognition (NER) , a subtask of Information Extraction, is the task \nof finding the phrases that contain person, location and organization names, times \nand quantities. Each word is tagged with the type of the name as well as its position \nin the name phrase (i.e. whether it is the first item of the phrase or not) in order \nto represent the boundary information. \n\n\fWe used a Spanish corpus which was provided for the Special Session of CoNLL2002 \non NER. The data is a collection of news wire articles and is tagged for person \nnames, organizations, locations and miscellaneous names. \n\nWe used simple binary features to ask questions about the word being tagged, as \nwell as the previous tag (i.e. HMM features). An example feature would be: Is the \ncurrent word= 'Clinton' and the tag='Person-Beginning '? We also used features to \nask detailed questions (i.e. spelling features) about the current word (e.g.: Is the \ncurrent word capitalized and the tag='Location-Intermediate'?) and the neighbor(cid:173)\ning words. These questions cannot be asked (in a principled way) in a generative \nHMM model. We ran experiments comparing the different loss functions optimized \nwith the conjugate gradient method and the boosting algorithm. We designed \nthree sets of features: HMM features (=31), 31 and detailed features of the cur(cid:173)\nrent word (= 32), and 32 and detailed features of the neighboring words (=33). \nThe results summarized in Table 1 \ndemonstrate the competitiveness of the \nproposed loss functions with respect to \n1{log. We observe that with different \nsets of features, the ordering of the per(cid:173)\nformance of the loss functions changes. \nBoosting performs worse than the conju(cid:173)\ngate gradient when only HMM features \nare used, since there is not much infor(cid:173)\nmation in the features other than the \nidentity of the word to be labeled. Con(cid:173)\nsequently, the boosting algorithm needs \nto include almost all weak learners in \nthe ensemble and cannot exploit feature \nsparseness. When there are more de-\ntailed features , the boosting algorithm is competitive with the conjugate gradient \nmethod, but has the advantage of generating sparser models. The conjugate gradi(cid:173)\nent method uses all of the available features, whereas boosting uses only about 10% \nof the features. \n\nTable 1: Test error of the Spanish cor(cid:173)\npus for named entity recognition. \n\n1{ 6.60 \n6.73 \n:F \n6.72 \n1{ \n:F \n6.67 \n6.15 \n1{ \n5.90 \n:F \n\n6.95 \n7.33 \n7.03 \n7.49 \n5.84 \n5.10 \n\nObjective \nexp \n\nboost \n8.05 \n\nFeature \n\nS2 \n\nS3 \n\nSet \n\nSl \n\nlog \n\n6.93 \n\n6.77 \n\n-\n\n-\n\n-\n\n7.2 Part of Speech Tagging \n\nWe used the Penn TreeBank corpus for \nthe part-of-speech tagging experiments. \nThe features were similar to the fea(cid:173)\nture sets Sl and S2 described above in \nthe context of NER. Table 2 summarizes \nthe experimental results obtained on this \ntask. \nrors obtained by different loss functions \nlie within a relatively small range. Qual(cid:173)\nitatively the behavior of the different op(cid:173)\ntimization methods is comparable to the \nNER experiments . \n\nIt can be seen that the test er(cid:173)\n\n7.3 General Comments \n\nFeature \n\nSet \n\nSl \n\nS2 \n\nlog \n\nObjective \nexp \n1{ 4.69 5.04 \n4.96 \n:F \n1{ 4.37 4.74 \n4.90 \n:F \n\n4.88 \n\n4.71 \n\nboost \n10.58 \n\n-\n\n5.09 \n\n-\n\nTable 2: Test error of the Penn Tree(cid:173)\nBank corpus for POS \n\nEven with the tighter bound in the boosting formulation , the same features are \nselected many times, because of the conservative estimate of the step size for pa(cid:173)\nrameter updates. We expect to speed up the convergence of the boosting algorithm \n\n\fby using a more sophisticated line search mechanism to compute the optimal step \nlength, a conjecture that will be addressed in future work. \n\nAlthough we did not use real-valued features in our experiments, we observed that \nincluding real-valued features in a conjugate gradient formulation is a challenge, \nwhereas it is very natural to have such features in a boosting algorithm. \n\nWe noticed in our experiments that defining a distribution over the training in(cid:173)\nstances using the inverse conditional probability creates problems in the boosting \nformulation for data sets that are highly unbalanced in terms of the length of the \ntraining sequences. To overcome this problem, we divided the sentences into pieces \nsuch that the variation in the length of the sentences is small. The conjugate gra(cid:173)\ndient optimization, on the other hand, did not appear to suffer from this problem. \n\n8 Conclusion and Future Work \n\nThis paper makes two contributions to the problem of learning label sequences. \nFirst, we have presented an efficient algorithm for discriminative learning of label \nsequences that combines boosting with dynamic programming. The algorithm com(cid:173)\npares favorably with the best previous approach, Conditional Random Fields, and \noffers additional benefits such as model sparseness. Secondly, we have discussed the \nuse of methods that optimize a label-by-labelloss and have shown that these meth(cid:173)\nods bear promise for further improving classification accuracy. Our future work will \ninvestigate the performance (in both accuracy and computational expenses) of the \ndifferent loss functions in different conditions (e.g. noise level, size of the feature \nset). \n\nAcknowledgments \n\nThis work was sponsored by an NSF-ITR grant, award number IIS-0085940. \n\nReferences \n[1] M. Collins. Discriminative reranking for natural language parsing. In Proceedings 17th \nInternational Conference on Machine Learning, pages 175- 182. Morgan Kaufmann , \nSan Francisco, CA, 2000. \n\n[2] M. Collins. Ranking algorithms for named- entity extraction: Boosting and the voted \nperceptron. In Proceedings 40th Annual Meeting of the Association for Computational \nLinguistics (ACL), pages 489- 496, 2002. \n\n[3] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Prob(cid:173)\n\nabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. \n\n[4] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical \n\nview of boosting. Annals of Statistics, 28:337- 374, 2000. \n\n[5] S. Kakade, Y.W. Teh, and S. Roweis. An alternative objective function for Markovian \n\nfields. In Proceedings 19th International Conference on Machine Learning, 2002. \n\n[6] J . Lafferty, A. McCallum, and F . Pereira. Conditional random fields: Probabilistic \nmodels for segmenting and labeling sequence data. In Proc. 18th International Conf. on \nMachine Learning, pages 282- 289. Morgan Kaufmann, San Francisco, CA, 200l. \n\n[7] C. Manning and H. Schiitze. Foundations of Statistical Natural Language Processing. \n\nMIT Press, 1999. \n\n[8] T. Minka. Algorithms for maximum-likelihood logistic regression. Technical report , \n\nCMU, Department of Statistics, TR 758, 200l. \n\n[9] R. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated pre(cid:173)\n\ndictions. Machine Learning, 37(3):297- 336, 1999. \n\n\f", "award": [], "sourceid": 2333, "authors": [{"given_name": "Yasemin", "family_name": "Altun", "institution": null}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": null}, {"given_name": "Mark", "family_name": "Johnson", "institution": null}]}