{"title": "Semi-Markov Conditional Random Fields for Information Extraction", "book": "Advances in Neural Information Processing Systems", "page_first": 1185, "page_last": 1192, "abstract": null, "full_text": "Semi-Markov Conditional Random Fields for\n\nInformation Extraction\n\nSunita Sarawagi\n\nWilliam W. Cohen\n\nIndian Institute of Technology\n\nCenter for Automated Learning & Discovery\n\nBombay, India\n\nsunita@iitb.ac.in\n\nCarnegie Mellon University\nwcohen@cs.cmu.edu\n\nAbstract\n\nWe describe semi-Markov conditional random \ufb01elds (semi-CRFs), a con-\nditionally trained version of semi-Markov chains.\nIntuitively, a semi-\nCRF on an input sequence x outputs a \u201csegmentation\u201d of x, in which\nlabels are assigned to segments (i.e., subsequences) of x rather than to\nindividual elements xi of x. Importantly, features for semi-CRFs can\nmeasure properties of segments, and transitions within a segment can\nbe non-Markovian. In spite of this additional power, exact learning and\ninference algorithms for semi-CRFs are polynomial-time\u2014often only a\nsmall constant factor slower than conventional CRFs.\nIn experiments\non \ufb01ve named entity recognition problems, semi-CRFs generally outper-\nform conventional CRFs.\n\n1\n\nIntroduction\n\nConditional random \ufb01elds (CRFs) are a recently-introduced formalism [12] for represent-\ning a conditional model Pr(y|x), where both x and y have non-trivial structure (often\nsequential). Here we introduce a generalization of sequential CRFs called semi-Markov\nconditional random \ufb01elds (or semi-CRFs). Recall that semi-Markov chain models extend\nhidden Markov models (HMMs) by allowing each state si to persist for a non-unit length of\ntime di. After this time has elapsed, the system will transition to a new state s0, which de-\npends only on si; however, during the \u201csegment\u201d of time between i and i + di, the behavior\nof the system may be non-Markovian. Generative semi-Markov models are fairly common\nin certain applications of statistics [8, 9], and are also used in reinforcement learning to\nmodel hierarchical Markov decision processes [19].\nSemi-CRFs are a conditionally trained version of semi-Markov chains. In this paper, we\npresent inference and learning methods for semi-CRFs. We also argue that segments often\nhave a clear intuitive meaning, and hence semi-CRFs are more natural than conventional\nCRFs. We focus here on named entity recognition (NER), in which a segment corresponds\nto an extracted entity; however, similar arguments might be made for several other tasks,\nsuch as gene-\ufb01nding [11] or NP-chunking [16].\nIn NER, a semi-Markov formulation allows one to easily construct entity-level features\n(such as \u201centity length\u201d and \u201csimilarity to other known entities\u201d) which cannot be easily\nencoded in CRFs. Experiments on \ufb01ve different NER problems show that semi-CRFs often\noutperform conventional CRFs.\n\n\f2 CRFs and Semi-CRFs\n\n2.1 De\ufb01nitions\n\nA CRF models Pr(y|x) using a Markov random \ufb01eld, with nodes corresponding to ele-\nments of the structured object y, and potential functions that are conditional on (features\nof) x. Learning is performed by setting parameters to maximize the likelihood of a set of\n(x, y) pairs given as training data. One common use of CRFs is for sequential learning\nproblems like NP chunking [16], POS tagging [12], and NER [15]. For these problems\nthe Markov \ufb01eld is a chain, and y is a linear sequence of labels from a \ufb01xed set Y. For in-\nstance, in the NER application, x might be a sequence of words, and y might be a sequence\nin {I, O}|x|, where yi = I indicates \u201cword xi is inside a name\u201d and yi = O indicates the\nopposite.\nAssume a vector f of local feature functions f = hf 1, . . . , f Ki, each of which maps a pair\n(x, y) and an index i to a measurement f k(i, x, y) \u2208 R. Let f (i, x, y) be the vector of these\nf (i, x, y). For example, in NER, the components\nof f might include the measurement f 13(i, x, y) = [[xi is capitalized]] \u00b7 [[yi = I]], where\nthe indicator function [[c]] = 1 if c if true and zero otherwise; this implies that F 13(x, y)\nwould be the number of capitalized words xi paired with the label I. Following previous\nwork [12, 16] we will de\ufb01ne a conditional random \ufb01eld (CRF) to be an estimator of the\nform\n\nmeasurements, and let F(x, y) = P|x|\n\ni\n\nPr(y|x, W) =\n\neW\u00b7F(x,y)\n\n(1)\n\n1\n\nZ(x)\n\nwhere W is a weight vector over the components of F, and Z(x) =Py0 eW\u00b7F(x,y0).\n\nTo extend this to the semi-Markov case, let s = hs1, . . . , spi denote a segmentation of x,\nwhere segment sj = htj, uj, yji consists of a start position tj, an end position uj, and a\nlabel yj \u2208 Y . Conceptually, a segment means that the tag yj is given to all xi\u2019s between\ni = tj and i = uj, inclusive. We assume segments have positive length, and completely\ncover the sequence 1 . . . |x| without overlapping:\nthat is, that tj and uj always satisfy\nt1 = 1, up = |x|, 1 \u2264 tj \u2264 uj \u2264 |x|, and tj+1 = uj + 1. For NER, a valid segmentation\nof the sentence \u201cI went skiing with Fernando Pereira in British Columbia\u201d might be s =\nh(1, 1, O), (2, 2, O), (3, 3, O), (4, 4, O), (5, 6, I), (7, 7, O), (8, 9, I)i, corresponding to the\nlabel sequence y = hO, O, O, O, I, I, O, I, Ii.\n\nWe now assume a vector g of segment feature functions g = hg1, . . . , gKi, each of\nwhich maps a triple (j, x, s) to a measurement gk(j, x, s) \u2208 R, and de\ufb01ne G(x, s) =\nj g(j, x, s). We also make a restriction on the features, analogous to the usual Marko-\nvian assumption made in CRFs, and assume that every component gk of g is a function\nonly of x, sj, and the label yj\u22121 associated with the preceding segment sj\u22121. In other\nwords, we assume that every gk(j, x, s) can be rewritten as\n\nP|s|\n\n(2)\nfor an appropriately de\ufb01ned g0k. In the rest of the paper, we will drop the g0 notation and\nuse g for both versions of the segment-level feature functions. A semi-CRF is then an\nestimator of the form\n\ngk(j, x, s) = g0k(yj, yj\u22121, x, tj, uj)\n\nPr(s|x, W) =\n\neW\u00b7G(x,s)\n\n(3)\n\n1\n\nZ(x)\n\nwhere again W is a weight vector for G and Z(x) =Ps0 eW\u00b7G(x,s0).\n\n2.2 An ef\ufb01cient inference algorithm\n\nThe inference problem for a semi-CRF is de\ufb01ned as follows: given W and x, \ufb01nd the\nbest segmentation, argmax s Pr(s|x, W), where Pr(s|x, W) is de\ufb01ned by Equation 3. An\n\n\fef\ufb01cient inference algorithm is suggested by Equation 2, which implies that\n\nargmax s Pr(s|x, W) = argmax sW \u00b7 G(x, s) = argmax sW \u00b7Xj\n\ng(yj, yj\u22121, x, tj, uj)\n\nLet L be an upper bound on segment length. Let si:y denote the set of all partial segmen-\ntations starting from 1 (the \ufb01rst index of the sequence) to i, such that the last segment has\nthe label y and ending position i. Let Vx,g,W (i, y) denote the largest value of W \u00b7 G(x, s0)\nfor any s0 \u2208 si:y. Omitting the subscripts, the following recursive calculation implements\na semi-Markov analog of the usual Viterbi algorithm:\n\nV (i, y) =( maxy0,d=1...L V (i \u2212 d, y0) + W \u00b7 g(y, y0, x, i \u2212 d + 1, i)\n\n0\n\u2212\u221e\n\nif i > 0\nif i = 0\nif i < 0\n\n(4)\n\nThe best segmentation then corresponds to the path traced by maxy V (|x|, y).\n\n2.3 Semi-Markov CRFs vs order-L CRFs\n\nSince conventional CRFs need not maximize over possible segment lengths d, inference\nfor semi-CRFs is more expensive. However, Equation 4 shows that the additional cost is\nonly linear in L. For NER, a reasonable value of L might be four or \ufb01ve.1 Since in the\nworst case L \u2264 |x|, the semi-Markov Viterbi algorithm is always polynomial, even when\nL is unbounded.\nFor \ufb01xed L, it can be shown that semi-CRFs are no more expressive than order-L CRFs.\nFor order-L CRFs, however the additional computational cost is exponential in L. The\ndifference is that semi-CRFs only consider sequences in which the same label is assigned\nto all L positions, rather than all |Y|L length-L sequences. This is a useful restriction, as it\nleads to faster inference.\nSemi-CRFs are also a natural restriction, as it is often convenient to express features in\nterms of segments. As an example, let dj denote the length of a segment, and let \u00b5\nbe the average length of all segments with label I. Now consider the segment feature\ngk1(j, x, s) = (dj \u2212 \u00b5)2 \u00b7 [[yj = I]]. After training, the contribution of this feature toward\nPr(s|x) associated with a length-d entity will be proportional to ewk\u00b7(d\u2212\u00b5)2\u2014i.e., it allows\nthe learner to model a Gaussian distribution of entity lengths.\nAn exponential model for lengths could be implemented with the feature gk2(j, x, y) =\ndj \u00b7 [[yj = I]]. In contrast to the Gaussian-length feature above, gk2 is \u201cequivalent to\u201d a\nlocal feature function f (i, x, y) = [[yi = I]], in the following sense: for every triple x, y, s,\n\nwhere y is the tags for s,Pj gk2(j, x, s) = Pi f (i, s, y). Thus a semi-CRF model based\n\non the single feature gk2 could also be represented by a conventional CRF.\nIn general, a semi-CRF model can be factorized in terms of an equivalent order-1 CRF\nmodel if and only if the sum of the segment features can be rewritten as a sum of local\nfeatures. Thus the degree to which semi-CRFs are non-Markovian depends on the feature\nset.\n\n2.4 Learning algorithm\n\nDuring training the goal is to maximize log-likelihood over a given training set T =\n`=1. Following the notation of Sha and Pereira [16], we express the log-\n{(x`, s`)}N\nlikelihood over the training sequences as\n\nL(W) =X`\n\nlog Pr(s`|x`, W) =X`\n\n(W \u00b7 G(x`, s`) \u2212 log ZW(x`))\n\n(5)\n\n1Assuming that non-entity words are placed in unit-length segments, as we do below.\n\n\fWe wish to \ufb01nd a W that maximizes L(W). Equation 5 is convex, and can thus be maxi-\nmized by gradient ascent, or one of many related methods. (In our implementation we use\na limited-memory quasi-Newton method [13, 14].) The gradient of L(W) is the following:\n\n\u2207L(W) = X`\n= X`\n\nG(x`, s`) \u2212 Ps0 G(s0, x`)eW\u00b7G(x`,s0)\n\nZW(x`)\n\nG(x`, s`) \u2212 EPr(s0|W)G(x`, s0)\n\n(6)\n\n(7)\n\nThe \ufb01rst set of terms are easy to compute. However,\nto compute the the normal-\nizer, ZW(x`), and the expected value of the features under the current weight vector,\nEPr(s0|W)G(x`, s0), we must use the Markov property of G to construct a dynamic pro-\ngramming algorithm, similar for forward-backward. We thus de\ufb01ne \u03b1(i, y) as the value of\neW\u00b7G(s0,x) where again si:y denotes all segmentations from 1 to i ending at i and\n\nlabeled y. For i > 0, this can be expressed recursively as\n\nPs0\u2208si:y\n\n\u03b1(i, y) =\n\nL\n\nXd=1 Xy0\u2208Y\n\n\u03b1(i \u2212 d, y0)eW\u00b7g(y,y0,x,i\u2212d+1,i)\n\nwith the base cases de\ufb01ned as \u03b1(0, y) = 1 and \u03b1(i, y) = 0 for i < 0. The value of ZW(x)\n\ncan then be written as ZW(x) =Py \u03b1(|x|, y).\nA similar approach can be used to compute the expectation Ps0 G(x`, s0)eW\u00b7G(x`,s0).\nPs0\u2208si:y\n\nsum\nGk(s0, x`)eW\u00b7G(x`,s0), restricted to the part of the segmentation ending at\n\nposition i. The following recursion2 can then be used to compute \u03b7k(i, y):\n\nthe k-th\n\ncomponent\n\nvalue\n\nof\n\nbe\n\nthe\n\nFor\n\nof G,\n\nlet\n\n\u03b7k(i, y)\n\nthe\n\n\u03b7k(i, y) =\n\nPL\nd=1Py0\u2208Y (\u03b7k(i\u2212d, y0) + \u03b1(i\u2212d, y0)gk(y, y0, x, i\u2212d + 1, i))eW\u00b7g(y,y0,x,i\u2212d+1,i)\n\nFinally we let EPr(s0|W)Gk(s0, x) = 1\n\nZW(x)Py \u03b7k(|x|, y).\n\n3 Experiments with NER data\n\n3.1 Baseline algorithms and datasets\n\nIn our experiments, we trained semi-CRFs to mark entity segments with the label I, and\nput non-entity words into unit-length segments with label O. We compared this with two\nversions of CRFs. The \ufb01rst version, which we call CRF/1, labels words inside and outside\nentities with I and O, respectively. The second version, called CRF/4, replaces the I tag\nwith four tags B, E, C, and U, which depend on where the word appears in an entity [2].\nWe compared the algorithms on \ufb01ve NER problems, associated with three different corpora.\nThe Address corpus contains 4,226 words, and consists of 395 home addresses of students\nin a major university in India [1]. We considered extraction of city names and state names\nfrom this corpus. The Jobs corpus contains 73,330 words, and consists of 300 computer-\nrelated job postings [4]. We considered extraction of company names and job titles. The\n18,121-word Email corpus contains 216 email messages taken from the CSPACE email\ncorpus [10], which is mail associated with a 14-week, 277-person management game. Here\nwe considered extraction of person names.\n\n2As in the forward-backward algorithm for chain CRFs [16], space requirements here can be\nreduced from M L|Y| to M |Y|, where M is the length of the sequence, by pre-computing an appro-\npriate set of \u03b2 values.\n\n\f3.2 Features\n\nAs features for CRF, we used indicators for speci\ufb01c words at location i, or locations within\nthree words of i. Following previous NER work [7]), we also used indicators for capi-\ntalization/letter patterns (such as \u201cAa+\u201d for a capitalized word, or \u201cD\u201d for a single-digit\nnumber).\nAs features for semi-CRFs, we used the same set of word-level features, as well their\nlogical extensions to segments. Speci\ufb01cally, we used indicators for the phrase inside a\nsegment and the capitalization pattern inside a segment, as well as indicators for words\nand capitalization patterns in 3-word windows before and after the segment. We also used\nindicators for each segment length (d = 1, . . . , L), and combined all word-level features\nwith indicators for the beginning and end of a segment.\nTo exploit more of the power of semi-CRFs, we also implemented a number of dictionary-\nderived features, each of which was based on different dictionary D and similarity function\nsim. Letting xsj denote the subsequence hxtj . . . xuj i, a dictionary feature is de\ufb01ned as\ngD,sim (j, x, s) = argmax u\u2208Dsim(xsj , u)\u2014i.e., the distance from the word sequence xsj\nto the closest element in D.\nFor each of the extraction problems, we assembled one external dictionary of strings, which\nwere similar (but not identical) to the entity names in the documents. For instance, for\ncity names in the Address data, we used a web page listing cities in India. Due to vari-\nations in the way entity names are written, rote matching these dictionaries to the data\ngives relatively low F1 values, ranging from 22% (for the job-title extraction task) to 57%\n(for the person-name task). We used three different similarity metrics (Jaccard, TFIDF,\nand JaroWinkler) which are known to work well for name-matching in data integration\ntasks [5]. All of the distance metrics are non-Markovian\u2014i.e., the distance-based segment\nfeatures cannot be decomposed into sums of local features. More detail on the distance\nmetrics, feature sets, and datasets above can be found elsewhere [6].\nWe also extended the semi-CRF algorithm to construct, on the \ufb02y, an internal segment\ndictionary of segments labeled as entities in the training data. To make measurements on\ntraining data similar to those on test data, when \ufb01nding the closest neighbor of xsj in the\ninternal dictionary, we excluded all strings formed from x, thus excluding matches of xsj to\nitself (or subsequences of itself). This feature could be viewed as a sort of nearest-neighbor\nclassi\ufb01er; in this interpretation the semi-CRF is performing a sort of bi-level stacking [21].\nFor completeness in the experiments, we also evaluated local versions of the dictionary\nfeatures. Speci\ufb01cally, we constructed dictionary features of the form f D,sim (i, x, y) =\nargmax u\u2208Dsim(xi, u), where D is either the external dictionary used above, or an internal\nword dictionary formed from all words contained in entities. As before, words in x were\nexcluded in \ufb01nding near neighbors to xi.\n\n3.3 Results and Discussion\n\nWe evaluated F1-measure performance3 of CRF/1, CRF/4, and semi-CRFs, with and with-\nout internal and external dictionaries. A detailed tabulation of the results are shown in Ta-\nble 1, and Figure 1 shows F1 values plotted against training set size for a subset of three of\nthe tasks, and four of the learning methods. In each experiment performance was averaged\nover seven runs, and evaluation was performed on a hold-out set of 30% of the documents.\nIn the table the learners are trained with 10% of the available data\u2014as the curves show,\nperformance differences are often smaller with more training data. Gaussian priors were\nused for all algorithms, and for semi-CRFs, a \ufb01xed value of L was chosen for each dataset\nbased on observed entity lengths. This ranged between 4 and 6 for the different datasets.\nIn the baseline con\ufb01guration in which no dictionary features are used, semi-CRFs perform\n\n3F1 is de\ufb01ned as 2*precision*recall/(precision+recall.)\n\n\fAddress_State\n\nAddress_City\n\nEmail_Person\n\ny\nc\na\nr\nu\nc\nc\na\nn\na\np\ns\n \n\n \n\n1\nF\n\n100\n90\n80\n70\n60\n50\n40\n30\n20\n10\n0\n0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5\n\nCRF/4\nSemiCRF+int\nCRF/4+dict\nSemiCRF+int+dict\n\n90\n\n85\n\n80\n\n75\n\n70\n\ny\nc\na\nr\nu\nc\nc\na\nn\na\np\ns\n \n\n \n\n1\nF\n\n65\n\n90\n\n85\n\n80\n\n75\n\n70\n\ny\nc\na\nr\nu\nc\nc\na\nn\na\np\ns\n \n\n \n\n1\nF\n\n65\n\nCRF/4\nSemiCRF+int\nCRF/4+dict\nSemiCRF+int+dict\n\nCRF/4\nSemiCRF+int\nCRF/4+dict\nSemiCRF+int+dict\n\n0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5\n\n0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5\n\nFraction of available training data\n\nFraction of available training data\n\nFraction of available training data\n\nFigure 1: F1 as a function of training set size. Algorithms marked with \u201c+dict\u201d include external\ndictionary features, and algorithms marked with \u201c+int\u201d include internal dictionary features. We do\nnot use internal dictionary features for CRF/4 since they lead to reduced accuracy.\n\nbaseline\nF1\n\n+internal dict\nF1 \u2206base\n\n+external dict\nF1 \u2206base\n\n+both dictionaries\n\nF1 \u2206base \u2206extern\n\n20.8\n28.5\n67.6\n70.3\n51.4\n\n15.0\n23.7\n70.9\n73.2\n54.8\n\n25.6\n33.8\n72.2\n75.9\n60.2\n\n44.5\n3.8\n48.0\n60.0\n16.5\n\n25.4\n7.9\n64.5\n70.6\n20.6\n\n35.5\n37.5\n74.8\n75.3\n59.7\n\n113.9\n-86.7\n-29.0\n-14.7\n-67.9\n\n69.3\n-66.7\n-9.0\n-3.6\n-62.4\n\n38.7\n10.9\n3.6\n-0.8\n-0.8\n\n69.2\n38.6\n81.4\n80.4\n55.3\n\n46.8\n36.4\n82.5\n80.8\n61.2\n\n62.7\n41.1\n82.8\n84.0\n60.9\n\n232.7\n35.4\n20.4\n14.4\n7.6\n\n212.0\n53.6\n16.4\n10.4\n11.7\n\n144.9\n21.5\n14.7\n10.7\n1.2\n\n55.2\n19.9\n64.7\n69.8\n15.6\n\n43.1\n14.6\n74.8\n76.3\n25.1\n\n65.2\n40.2\n83.7\n83.6\n60.9\n\n165.4\n-30.2\n-4.3\n-0.7\n-69.6\n\n187.3\n-38.4\n5.5\n4.2\n-54.2\n\n154.7\n18.9\n15.9\n10.1\n1.2\n\n-67.3\n-65.6\n-24.7\n-15.1\n-77.2\n\n-24.7\n-92.0\n-10.9\n-6.1\n-65.9\n\n9.8\n-2.5\n1.2\n-0.5\n0.0\n\nCRF/1\nstate\ntitle\nperson\ncity\ncompany\nCRF/4\nstate\ntitle\nperson\ncity\ncompany\nsemi-CRF\nstate\ntitle\nperson\ncity\ncompany\n\nTable 1: Comparing various methods on \ufb01ve IE tasks, with and without dictionary features. The\ncolumn \u2206base is percentage change in F1 values relative to the baseline. The column \u2206extern is is\nchange relative to using only external-dictionary features.\n\nbest on all \ufb01ve of the tasks. When internal dictionary features are used, the performance\nof semi-CRFs is often improved, and never degraded by more than 2.5%. However, the\nless-natural local version of these features often leads to substantial performance losses for\nCRF/1 and CRF/4. Semi-CRFs perform best on nine of the ten task variants for which\ninternal dictionaries were used. The external-dictionary features are helpful to all the algo-\nrithms. Semi-CRFs performs best on three of \ufb01ve tasks in which only external dictionaries\nwere used.\nOverall, semi-CRF performs quite well. If we consider the tasks with and without external\ndictionary features as separate \u201cconditions\u201d, then semi-CRFs using all available informa-\ntion4 outperform both CRF variants on eight of ten \u201cconditions\u201d.\nWe also compared semi-CRF to order-L CRFs, with various values of L.5 In Table 2 we\nshow the result for L = 1, L = 2, and L = 3, compared to semi-CRF. For these tasks, the\nperformance of CRF/4 and CRF/1 does not seem to improve much by simply increasing\n\n4I.e., the both-dictionary version when external dictionaries are available, and the internal-\n\ndictionary only version otherwise.\n\nto L \u2264 3 for computational reasons.\n\n5Order-L CRFs were implemented by replacing the label set Y with Y L. We limited experiments\n\n\fCRF/1\n\nCRF/4\n\nsemi-CRF\n\nAddress State\nAddress City\nEmail persons\n\nL = 1 L = 2 L = 3\n19.2\n71.2\n66.7\n\n20.8\n70.3\n67.6\n\n20.1\n71.0\n63.7\n\nL = 1 L = 2 L = 3\n16.4\n73.7\n70.4\n\n16.4\n73.9\n70.7\n\n15.0\n73.2\n70.9\n\n25.6\n75.9\n72.2\n\nTable 2: F1 values for different order CRFs\n\norder.\n\n4 Related work\n\nSemi-CRFs are similar to nested HMMs [1], which can also be trained discrimini-\ntively [17]. The primary difference is that the \u201cinner model\u201d for semi-CRFs is of short,\nuniformly-labeled segments with non-Markovian properties, while nested HMMs allow\nlonger, diversely-labeled, Markovian \u201csegments\u201d.\nDiscriminative learning methods can be used for conditional random \ufb01elds with architec-\ntures more complex than chains (e.g., [20, 18]), and one of these methods has also been\napplied to NER [3]. Further, by creating a random variable for each possible segment, one\ncan learn models strictly more expressive than the semi-Markov models described here.\nHowever, for these methods, inference is not tractable, and hence approximations must be\nmade in training and classi\ufb01cation. An interesting question for future research is whether\nthe tractible extension to CRF inference considered here can can be used to improve infer-\nence methods for more expressive models.\nIn recent prior work [6], we investigated semi-Markov learning methods for NER based\non a voted perceptron training algorithm [7]. The voted perceptron has some advantages\nin ease of implementation, and ef\ufb01ciency. (In particular, the voted perceptron algorithm\nis more stable numerically, as it does not need to compute a partition function. ) How-\never, semi-CRFs perform somewhat better, on average, than our perceptron-based learning\nalgorithm. Probabilistically-grounded approaches like CRFs also are preferable to margin-\nbased approaches like the voted perceptron in certain settings, e.g., when it is necessary to\nestimate con\ufb01dences in a classi\ufb01cation.\n\n5 Concluding Remarks\n\nSemi-CRFs are a tractible extension of CRFs that offer much of the power of higher-order\nmodels without the associated computational cost. A major advantage of semi-CRFs is that\nthey allow features which measure properties of segments, rather than individual elements.\nFor applications like NER and gene-\ufb01nding [11], these features can be quite natural.\n\nAppendix\n\nAn implementation of semi-CRFs is available at http://crf.sourceforge.net, and a NER\npackage using this package is available on http://minorthird.sourceforge.net.\n\n\fReferences\n[1] V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic text segmentation for extracting\nstructured records. In Proc. ACM SIGMOD International Conf. on Management of Data, Santa\nBarabara,USA, 2001.\n\n[2] A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources\nvia maximum entropy in named entity recognition. In Sixth Workshop on Very Large Corpora\nNew Brunswick, New Jersey. Association for Computational Linguistics., 1998.\n\n[3] R. Bunescu and R. J. Mooney. Relational markov networks for collective information extrac-\nIn Proceedings of the ICML-2004 Workshop on Statistical Relational Learning (SRL-\n\ntion.\n2004), Banff, Canada, July 2004.\n\n[4] M. E. Califf and R. J. Mooney. Bottom-up relational learning of pattern matching rules for\n\ninformation extraction. Journal of Machine Learning Research, 4:177\u2013210, 2003.\n\n[5] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for\nname-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration\non the Web (IIWeb-03), 2003.\n\n[6] W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: Combining\nsemi-markov extraction processes and data integration methods. In Proceedings of the Tenth\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004.\n\n[7] M. Collins. Discriminative training methods for hidden Markov models: Theory and exper-\nIn Empirical Methods in Natural Language Processing\n\niments with perceptron algorithms.\n(EMNLP), 2002.\n\n[8] X. Ge. Segmental Semi-Markov Models and Applications to Sequence Analysis. PhD thesis,\n\nUniversity of California, Irvine, December 2002.\n\n[9] J. Janssen and N. Limnios. Semi-Markov Models and Applications. Kluwer Academic, 1999.\n[10] R. E. Kraut, S. R. Fussell, F. J. Lerch, and J. A. Espinosa. Coordination in teams: evidence from\n\na simulated management game. To appear in the Journal of Organizational Behavior, 2005.\n\n[11] A. Krogh. Gene \ufb01nding: putting the parts together. In M. J. Bishop, editor, Guide to Human\n\nGenome Computing, pages 261\u2013274. Academic Press, 2nd edition, 1998.\n\n[12] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for\nIn Proceedings of the International Conference on\n\nsegmenting and labeling sequence data.\nMachine Learning (ICML-2001), Williams, MA, 2001.\n\n[13] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large-scale optimization.\n\nMathematic Programming, 45:503\u2013528, 1989.\n\n[14] R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Pro-\nceedings of The Sixth Conference on Natural Language Learning (CoNLL-2002), pages 49\u201355,\n2002.\n\n[15] A. McCallum and W. Li. Early results for named entity recognition with conditional random\n\ufb01elds, feature induction and web-enhanced lexicons. In Proceedings of The Seventh Conference\non Natural Language Learning (CoNLL-2003), Edmonton, Canada, 2003.\n\n[16] F. Sha and F. Pereira. Shallow parsing with conditional random \ufb01elds. In Proceedings of HLT-\n\nNAACL, 2003.\n\n[17] M. Skounakis, M. Craven, and S. Ray. Hierarchical hidden Markov models for information\nextraction. In Proceedings of the 18th International Joint Conference on Arti\ufb01cial Intelligence,\nAcapulco, Mexico. Morgan Kaufmann., 2003.\n\n[18] C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random \ufb01elds: Factor-\n\nized probabilistic models for labeling and segmenting sequence data. In ICML, 2004.\n\n[19] R. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal\n\nabstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112:181\u2013211, 1999.\n\n[20] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In\nProceedings of Eighteenth Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI02), Ed-\nmonton, Canada, 2002.\n\n[21] D. H. Wolpert. Stacked generalization. Neural Networks, 5:241\u2013259, 1992.\n\n\f", "award": [], "sourceid": 2648, "authors": [{"given_name": "Sunita", "family_name": "Sarawagi", "institution": null}, {"given_name": "William", "family_name": "Cohen", "institution": null}]}