{"title": "Adaptive Mixture of Probabilistic Transducers", "book": "Advances in Neural Information Processing Systems", "page_first": 381, "page_last": 387, "abstract": null, "full_text": "Adaptive Mixture of Probabilistic Transducers \n\nYoram Singer \n\nAT&T Bell Laboratories \nsinger@research.att.com \n\nAbstract \n\nWe introduce and analyze a mixture model for supervised learning of \nprobabilistic transducers. We devise an online learning algorithm that \nefficiently infers the structure and estimates the parameters of each model \nin the mixture. Theoretical analysis and comparative simulations indicate \nthat the learning algorithm tracks the best model from an arbitrarily large \n(possibly infinite) pool of models. We also present an application of the \nmodel for inducing a noun phrase recognizer. \n\n1 \n\nIntroduction \n\nSupervised learning of a probabilistic mapping between temporal sequences is an important \ngoal of natural sequences analysis and classification with a broad range of applications such \nas handwriting and speech recognition, natural language processing and DNA analysis. Re(cid:173)\nsearch efforts in supervised learning of probabilistic mappings have been almost exclusively \nfocused on estimating the parameters of a predefined model. For example, in [5] a second \norder recurrent neural network was used to induce a finite state automata that classifies \ninput sequences and in [1] an input-output HMM architecture was used for similar tasks. \n\nIn this paper we introduce and analyze an alternative approach based on a mixture model \nof a new subclass of probabilistic transducers, which we call suffix tree transducers. The \nmixture of experts architecture has been proved to be a powerful approach both theoretically \nand experimentally. See [4,8,6, 10,2, 7] for analyses and applications of mixture models, \nfrom different perspectives such as connectionism, Bayesian inference and computational \nlearning theory. By combining techniques used for compression [13] and unsupervised \nlearning [12], we devise an online algorithm that efficiently updates the mixture weights \nand the parameters of all the possible models from an arbitrarily large (possibly infinite) \npool of suffix tree transducers. Moreover, we employ the mixture estimation paradigm to \nthe estimation of the parameters of each model in the pool and achieve an efficient estimate \nof the free parameters of each model. We present theoretical analysis, simulations and \nexperiments with real data which show that the learning algorithm indeed tracks the best \nmodel in a growing pool of models, yielding an accurate approximation of the source. All \nproofs are omitted due to the lack of space \n\n2 Mixture of Suffix Tree Transducers \n\nLet ~in and ~Ot.lt be two finite alphabets. A Suffix Tree Transducer T over (~in, ~Ot.lt) is a \nrooted,l~jn I-ary tree where every internal node of T has one child for each symbol in ~in. \nThe nodes of the tree are labeled by pairs (s, l' ~), where s is the string associated with the path \n(sequence of symbols in ~n) that leads from the root to that node, and 1'~ : ~Ot.lt -+ [0,1] \nis the output probability function. A suffix tree transducer (stochastically) maps arbitrarily \nlong input sequences over ~in to output sequences over ~Ot.lt as follows. The probability \n\n\f382 \n\nY. SINGER \n\nthat T will output a string Yl, Y2, ... ,Yn in I:~ut given an input string Xl, X2, ... , Xn in \nI:in, denoted by PT(Yl, Y2, ... , YnlXl, X2 , \"\" xn), is n~=li8. (Yk), where sl = Xl, and \nfor 1 ::; j ::; n - 1, si is the string labeling the deepest node reached by taking the path \ncorresponding to xi, xi -1, Xi -2, ... starting at the root of T. A suffix tree transducer is \ntherefore a probabilistic mapping that induces a measure over the possible output strings \ngiven an input string. Examples of suffix tree transducers are given in Fig. 1. \n\nFigure 1: A suffix tree transducer (left) over (Lin, LQut) = ({O, 1} , {a , b, c}) and two ofits possible \nsub-models (subtrees). The strings labeling the nodes are the suffixes of the input string used to predict \nthe output string. At each node there is an output probability function defined for each of the possible \noutput symbols. For instance, using the suffix tree transducer depicted on the left, the probability of \nobserving the symbol b given that the input sequence is ... ,0, 1,0, is 0.1. The probability of the \ncurrent output, when each transducer is associated with a weight (prior), is the weighted sum of the \npredictions of each transducer. For example, assume that the weights of the trees are 0.7 (left tree), 0.2 \n(middle), and 0.1. then the probability thattheoutputYn = a given that (X n -2, Xn-l, Xn) = (0,1,0) \nis 0.7\u00b7 P7j (aIOl0) + 0.2\u00b7 P7i(aIIO) + 0.1 . P7)(aIO) = 0.7 . 0.8 + 0.2 . 0.7 + 0.1 . 0.5 = 0.75. \n\nGiven a suffix tree transducer T we are interested in the prediction of the mixture of all \npossible subtrees of T. We associate with each subtree (including T) a weight which can be \ninterpreted as its prior probability. We later show how the learning algorithm of a mixture \nof suffix tree transducers adapts these weights with accordance to the performance (the \nevidence in Bayesian terms) of each subtree on past observations. Direct calculation of the \nmixture probability is infeasible since there might be exponentially many such subtrees. \nHowever, the technique introduced in [13] can be generalized and applied to our setting. \nLet T' be a subtree of T. Denote by nl the number of the internal nodes of T' and by \nn2 the number of leaves of T' which are not leaves of T. For example, nl = 2 and \nn2 = I, for the tree depicted on the right part of Fig. 1, assuming that T is the tree \ndepicted on the left part of the figure. The prior weight of a tree T'. denoted by Po(T') is \ndefined to be (1 - Q-)n\\ a n2 , where a E (0, 1). Denote by Sub(T) the set of all possible \nsubtrees of T including T itself. It can be easily verified that this definition of the weights \nis a proper measure, i.e., LT/ESUb(T) Po(T') = 1. This distribution over trees can be \nextended to unbounded trees assuming that the largest tree is an infinite lI:in I-ary suffix tree \ntransducer and using the following randomized recursive process. We start with a suffix \ntree that includes only the root node. With probability a we stop the process and with \nprobability 1 - a we add all the possible lI:in I sons of the node and continue the process \nrecursively for each of the sons. Using this recursive prior the suffix tree transducers, we \ncan calculate the prediction of the mixture at step n in time that is linear in n, as follows, \n\naie(Yn) + (1 - a) (aixn(Yn) + (1- a) (aixn_\\xn(Yn) + (1 - a) ... \n\nTherefore, the prediction time of a single symbol is bounded by the maximal depth of T, \nor the length of the input sequence if T is infinite. Denote by 1'8 (Yn) the prediction of the \nmixture of subtrees rooted at s, and let Leaves(T) be the set of leaves of T . The above \n\n\fAdaptive Mixture of Probabilistic Transducers \n\nsum equals to 'Ye(Yn), and can be evaluated recursively as follows,1 \n\n'Y3(Yn) = { 13(Yn) \n\n_ \n\na I3 (Yn) + (I - a)r(X n_I.I.3)(Yn) otherwise \n\nS E Le~ves(T) \n\n383 \n\n( ) \nI \n\nFor example, given that the input sequence is ... ,0, 1, 1,0, then the probabilities of the \nmixtures of subtrees for the tree depicted on the left part of Fig. 1, for Yn = b and given \nthat a = 1/2, are, 'Yllo(b) = 0.4 , 'YlO(b) = 0.5 . 11O(b) + 0.5 . 0.4 = 0.3 , 'Yo(b) = \n0.5 . lo(b) + 0.5 \u00b70.3 = 0.25, 'Ye(b) = 0.5 . le(b) + 0.5 \u00b70.25 = 0.25. \n\n3 An Online Learning Algorithm \n\nWe now describe an efficient learning algorithm for a mixture of suffix tree transducers. \nThe learning algorithm uses the recursive priors and the evidence to efficiently update \nthe posterior weight of each possible subtree. In this section we assume that the output \nprobability functions are known. Hence, we need to evaluate the following, \n\n~ P(Yn IT')P(T'I(XI, YI), ... ,(Xn_l, Yn-t) \n\nT'ESub(T) \n\ndef ~ P(Yn IT')Pn (T') \n\nT'ESub(T) \n\n(2) \n\nwhere Pn(T') is the posterior weight of T'. Direct calculation of the above sum requires \nexponential time.. However, using the idea of recursive calculation as in Equ. (1) we \ncan efficiently calculate the prediction of the mixture. Similar to the definition of the \nrecursive prior a, we define qn (s) to be the posterior weight of a node S compared to the \nmixture of all nodes below s. We can compute the prediction of the mixture of suffix tree \ntransducers rooted at s by simply replacing the prior weight a with the posterior weight, \nqn-l (s), as follows, \n\n) _ { 13(Yn) \n\nqn-I(S)r3(Yn) + (1 - qn-l(S\u00bb'Y(X n _I.I.3)(Yn) otherwise \n\n_ ( \n13 Yn -\nIn order to update qn(s) we introduce one more variable, denoted by rn(s). \nro(s) = 10g(a/(1 - a\u00bb for all s, rn(s) is updated as follows, \n\nS E Leaves(T) \n\nrn(s) = rn-l(s) + log(/3(Yn\u00bb -log('YXn_I.13(Yn\u00bb . \n\n, \n\n(3) \n\nSetting \n\n(4) \n\nTherefore, rn( s) is the log-likelihood ratio between the prediction of s and the prediction \nof the mixture of all nodes below s in T. The new posterior weights qn (s) are calculated \nfrom rn(s), \n\n(5) \nIn summary, for each new observation pair, we traverse the tree by following the path that \ncorresponds to the input sequence x n X n -I X n _ 2 .. . The predictions of each sub-mixture are \ncalculated using Equ. (3). Given these predictions the posterior weights of each sub-mixture \nare updated using Equ. (4) and Equ. (5). Finally, the probability of Yn induced by the whole \nmixture is the prediction propagated out of the root node, as stated by Lemma 3.1. \n\nLemma3.1 LT'ESub(T) P(YnlT')Pn(T') = 'Ye(Yn). \n\nLet Lossn (T) be the logarithmic loss (negative log-likelihood) of a suffix tree transducer T \nafter n input-output pairs. That is, Lossn(T) = L7=1 -log(P(YiIT\u00bb. Similarly, the loss \n\n1 A similar derivation still holds even if there is a different prior 0'. at each node s of T. For the \n\nsake of simplicity we assume that 0' is constant. \n\n\f384 \n\nY. SINGER \n\nof the mixture is defined to be, Lossr;:ix = 2:~=1 -log(.ye(yd). The advantage of using \na mixture of suffix tree transducers over a single suffix tree is due to the robustness of the \nsolution, in the sense that the prediction of the mixture is almost as good as the prediction \nof the best suffix tree in the mixture. \n\nTheorem 1 Let T be a \nlet \n(Xl, yd, .. . , (xn, Yn) be any possible sequence of input-output pairs. The loss of the \nmixture is at most, Lossn(T') -log(Po(T'\u00bb,Jor each possible subtree T'. The running \ntime of the algorithm is D n where D is the maximal depth ofT or n2 when T is infinite. \n\ntransducer, and \n\ntree \n\n(possibly \n\ninfinite) suffix \n\nThe proof is based on a technique introduced in [4]. Note that the additional loss is constant, \nhence the normalized loss per observation pair is, Po(T')/n, which decreases like O(~). \nGiven a long sequence of input-output pairs or many short sequences, the structure of the \nsuffix tree transducer is inferred as well. This is done by updating the output functions, as \ndescribed in the next section, while adding new branches to the tree whenever the suffix \nof the input sequence does not appear in the current tree. The update of the weights, \nthe parameters, and the structure ends when the maximal depth is reached, or when the \nbeginning of the input sequence is encountered. \n\n4 Parameter Estimation \n\nIn this section we describe how the output probability functions are estimated. Again, we \ndevise an online scheme. Denote by C;'(y) the number of times the output symbol y was \nobserved out of the n times the node s was visited. A commonly used estimator smoothes \neach count by adding a constant ( as follows, \n\n(6) \nThe special case of ( = ~ is termed Laplace's modified rule of succession or the add~ \nestimator. In [9], Krichevsky and Trofimov proved that the loss of the add~ estimator, when \napplied sequentially, has a bounded logarithmic loss compared to the best (maximum(cid:173)\nlikelihood) estimator calculated after observing the entire input-output sequence. The \nadditional loss of the estimator after n observations is, 1/2(II:out l - 1) log(n) + lI:outl-l. \nWhen the output alphabet I:out is rather small, we approximate \"y 8 (y) by 78 (y) using Equ. (6) \nand increment the count of the corresponding symbol every time the node s is visited. We \npredict by replacing \"y with its estimate 7 in Equ. (3). The loss of the mixture with estimated \noutput probability functions, compared to any subtree T' with known parameters, is now \nbounded as follows, \nLOSS,:ix ~ Lossn(T') -log(Po(T')) + 1/2IT'1 (Il:outl-l) log(n/IT'I) + IT'I (Il:outl-l), \nwhere IT'I is the number of leaves in T'. This bound is obtained by combining the bound \non the prediction of the mixture from Thm. 1 with the loss of the smoothed estimator while \napplying Jensen's inequality [3]. \n\nWhen lI:out 1 is fairly large or the sample size if fairly small, the smoothing of the output \nprobabilities is too crude. However, in many real problems, only a small subset of the \noutput alphabet is observed in a given context (a node in the tree). For example, when \nmapping phonemes to phones [II], for a given sequence of input phonemes the phones that \ncan be pronounced is limited to a few possibilities. Therefore, we would like to devise an \nestimation scheme that statistically depends on the effective local alphabet and not on the \nwhole alphabet. Such an estimation scheme can be devised by employing again a mixture \nof models, one model for each possible subset I:~ut of I:out. Although there are 211:0 .. ,1 \nsubsets of I:out, we next show that if the estimators depend only on the size of each subset \nthen the whole mixture can be maintained in time linear in lI:out I. \n\n\fAdaptive Mixture of Probabilistic Transducers \n\n385 \n\nDenote by .y~ (YIII:~ut 1 = i) the estimate of 'Y~ (y) after n observations given that the \nalphabet I:~t is of size i. Using the add~ estimator, .y~(YIII:~utl = i) = (C~(y) + \n1/2)/(n + i/2). Let I:~ut(s) be the set of different output symbols observed at node s, i.e. \n\nI:~ut(s) = {u 1 u = Yi\", s = (xi\"-I~I+1' .. . ,Xi,,), 1 $ k $ n} \nout \n\nand define I:0 \nsize i. Thus, the prediction of the mixture of all possible subsets of I:out is, \n\n(s) to be the empty set There are (IIo.'I-II~.,(~)I) possible alphabets of \n\ni-II:.,(~)I \n\n, \n\n. \n\nAn( ) = I~I (lI:outl- 1I:~ut(s)l) \ni~ Y \n\nL...J \n\n. _ lI:n (s)1 \nJ \n\nout \n\nj=II:.,(~)1 \n\n':l An( I\u00b7) \nw1 'Y~ Y J \n\n, \n\n(7) \n\nwhere wi is the posterior probability of an alphabet of size i. Evaluation of this sum \nrequires O(II:01.lt I) operations (and not 0(2IIo.,1 \u00bb. We can compute Equ. (7) in an online \nfashion as follows. Let, \n\n( lI:01.l tl- 1I:~ut(s)l) 0 TIn Ak-l( . 10) \ny,,, ~ \n\n'Y8 \n\nW, \n\n0_ lI:n ()I \n2 \n\nout s \n\nk=l \n\n(8) \n\nWithout loss of generality, let us assume a uniform prior for the possible alphabet sizes. \nThen, \n\nPo(I:~ut) = Po(II:~1.Itl = i) ~ w? = 1/ (I I:01.lt1 (lI::utl)) 0 \nThus, for all i ~(i) = 1/II:01.ltl. ~+1 (i) is updated from ~(i) as follows, \n\nm+l (0) _ m ( 0) \n1 ~ \n\n1 ~ 2 X \n\n2 \n\n-\n\n{~:(';'j;' )+'/2 \n\nn+I/2 \n\ni-II~y,(~)1 ~ \n\nIIo.,I-II:.,(8)1 n+i/2 \n\nif 1I:~ti/(s)1 > i \nif 1I:~titl(s)1 $ i and Yi,,+1 E I:~1.It(s) \nif 1I:~titl(s)1 $ i and Yi,,+1 \u00a2 I:~ut(s) \n\nInformally: If the number of different symbols observed so far exceeds a given size then all \nalphabets of this size are eliminated from the mixture by slashing their posterior probability \nto zero. Otherwise, if the next symbol was observed before, the output probability is the \nprediction of the addi estimator. Lastly, if the next symbol is entirely new, we need to sum \nthe predictions of all the alphabets of size i which agree on the first 1I:~1.It(s)1 and Yi,,+1 is \none of their i -\n1I:~1.It(s)1 (yet) unobserved symbols. Funhermore, we need to multiply by \nthe apriori probability of observing Yi ,,+10 Assuming a uniform prior over the unobserved \nsymbols this probability equals to 1/(II:01.lt 1 -\n1I:~1.It( s)l). Applying Bayes rule again, the \nprediction of the mixture of all possible subsets of the output alphabet is, \n\nIIo.\" \n\n.y~(Yin+l) = 2: ~+l(i) / 2: ~(i) \u00b0 \n\nIIo.,1 \n\n(9) \n\ni=l \n\ni=l \n\nApplying twice the online mixture estimation technique, first for the structure and then \nfor the parameters, yields an efficient and robust online algorithm. For a sample of size \nn, the time complexity of the algorithm is DII:01.ltln (or lI:01.ltln2 if 7 is infinite). The \npredictions of the adaptive mixture is almost as good as any suffix tree transducer with any \nset of parameters. The logarithmic loss of the mixture depends on the number of non-zero \nparameters as follows, \n\nLossr;:ix $ Lossn (7') -log(Po(7'\u00bb + 1/21Nz log(n) + 0(17'III:01.ltl) , \n\nwhere lNz is the number of non-zero parameters of the transducer 7'0 If lNz ~ 17'III:out l \nthen the performance of the above scheme, when employing a mixture model for the \nparameters as well, is significantly better than using the add~ rule with the full alphabet. \n\n\f386 \n\nY. SINGER \n\n5 Evaluation and Applications \n\nIn this section we briefly present evaluation results of the model and its learning algorithm. \nWe also discuss and present results obtained from learning syntactic structure of noun \nphrases. We start with an evaluation of the estimation scheme for a multinomial source. \n\nIn order to check the convergence of a mixture model for a multinomial source, we simulated \na source whose output symbols belong to an alphabet of size 10 and set the probabilities of \nobserving any of the last five symbols to zero. Therefore, the actual alphabet is of size 5. \nThe posterior probabilities for the sum of all possible subsets of I:out of size i (1 :::; i :::; 10) \nwere calculated after each iteration. The results are plotted on the left part of Fig. 2. The \nvery first observations rule out alphabets of size lower than 5 by slashing their posterior \nprobability to zero. After few observations, the posterior probability is concentrated around \nthe actual size, yielding an accurate online estimate of the multinomial source. \n\nThe simplicity of the learning algorithm and the online update scheme enable evaluation of \nthe algorithm on millions of input-output pairs in few minutes. For example, the average \nupdate time for a suffix tree transducer of a maximal depth 10 when the output alphabet \nis of size 4 is about 0.2 millisecond on a Silicon Graphics workstation. A typical result is \nshown in Fig. 2 on the right. In the example, I:out = I:in = {I, 2, 3,4}. The description \nof the source is as follows. If Xn ~ 3 then Yn is uniformly distributed over I:out. otherwise \n(xn :::; 2) Yn = Xn-S with probability 0.9 and Yn-S = 4 - Xn-S with probability 0.1. The \ninput sequence Xl, X2, .\u2022\u2022 was created entirely at random. This source can be implemented \nby a sparse suffix tree transducer of maximal depth 5. Note that the actual size of the \nalphabet is only 2 at half of the leaves of the tree. We used a suffix tree transducer of \nmaximal depth 20 to learn the source. The negative of the logarithm of the predictions \n(normalized per symbol) are shown for (a) the true source, (b) a mixture of suffix tree \ntransducers and their parameters, (c) a mixture of only the possible suffix tree transducers \n(the parameters are estimated using the addl scheme), and (d) a single (overestimated) \nmodel of depth 8. Clearly, the mixture mode? converge to the entropy of the source much \nfaster than the single model. Moreover, employing twice the mixture estimation technique \nresults in an even faster convergence. \n\"I .. \n\nUbUI \n.. : .:~.: ......... ~.:~.:~ \n\n\"b'~1 - \"I -\n\n1.B \n\n. , . \u2022 \n\n,. \n\nI \n\n,. \n\n\u2022 \n\n1. \n\nI \" \n\nI , . \n\n\u2022 \n\n0.1 \n\n\u2022 \n\n..... t \n\n..... t \n\n. . . . . t \n\n-. \n\"~'\u00b7\u00b7b\"b\"'b\"b'\u00b7b \n.. \n. \n'b'\u00b7L'\u00b7\u00b7LUL\"~'\u00b7~ \n\n.. .. \n. \n. \n. \n... \n\n.. \n.. \n. \n\n.. \n. \n. \n\n\u2022 \u2022 II \u2022 \u2022 I\" \n\n.... It \n\n.... II \n\nI \n\n.. \n.. \n. \n\n\" \n\n\" \n\n\u2022 \n\n\u2022 \n\nII \n\n... \n\n\u2022\u2022 \" \n\nI \" \n\nI \" \n\nI \" \n\nI \" \n\n\u2022 \n\n\" \n\n\u2022\u2022\u2022\u2022 \n\n0 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \u2022 \n\n\u2022 \n\ntI \n\nI \n\n1. \n\nI \" \n\n\u2022 \nI , . \n\nI \n\nI \n\n\" \n\nI \" \n\n.g 1.6 \nIi: \n\n(d) Single Overestimated Model \n(e) Mixture 01 Models \n\n-\n-\n-\n.--- (bl MiX1ure 01 Models and Parameters \n.. ...... (a Source \n\n\"I ... \"I -. \"I -_ ':1 -. \"I .. _ \"I .. _ \n,:~,:~,:~,~,:~,:~ \n\nI \" \n\nI \n\n10 \n\nI \n\nto \n\n. , . I \n\n\" \n\n5 \n\n,. \n\n50 100 150 200 250 300 350 400 450 500 \n\nNumber 01 Examples \n\nFigure 2: Left: Example of the convergence of the posterior probability of a mixture model for \na multinomial source with large number of possible outcomes when the actual number of observed \nsymbols is small. Right: performance comparison of the predictions of a single model, two mixture \nmodels and the true underlying transducer. \n\nWe are currently exploring the applicative possibilities of the algorithm. Here we briefly \ndiscuss and demonstrate how to induce an English noun phrase recognizer. Recognizing \nnoun phrases is an important task in automatic natural text processing, for applications \nsuch as information retrieval, translation tools and data extraction from texts. A common \npractice is to recognize noun phrases by first analyzing the text with a part-of-speech tagger, \nwhich assigns the appropriate part-of-speech (verb, noun, adjective etc.) for each word in \n\n\fAdaptive Mixture of Probabilistic Transducers \n\n387 \n\ncontext. Then, noun phrases are identified by manually defined regular expression patterns \nthat are matched against the part-of-speech sequences. We took an alternative route by \nbuilding a suffix tree transducer based on a labeled data set from the UPENN tree-bank \ncorpus. We defined I:in to be the set of possible part-of-speech tags and set I:out = {O, I}, \nwhere the output symbol given its corresponding input symbol (the part-of-speech tag of \nthe current word) is 1 iff the word is part of a noun phrase. We used over 250, 000 marked \ntags and tested the performance on more than 37 , 000 tags. The test phase was performed \nby freezing the model structure, the mixture weights and the estimated parameters. The \nsuffix tree transducer was of maximal depth 15 hence very long phrases can be statistically \nidentified. By tresholding the output probability we classified the tags in the test data \nand found that less than 2.4% of the words were misclassified. A typical result is given \nin Table 1. We are currently investigating methods to incorporate linguistic knowledge \ninto the model and its learning algorithm and compare the performance of the model with \ntraditional techniques. \nTcm \nPNP \n\nmetal. \nNNS \n\nSmith \nPNP \n\ncxcc:utiYe \n\nScmrmc:e \npos tag \nClass \nPred i ction \nSentence \nP~S tag \nClass \nPrediction \n\n1 \n0.99 \nODd \nCC \n1 \n0.67 \n\n1 \n0.99 \n\nindustrial \n\nJJ \n1 \n0.96 \n\n0 \n0.01 \n\ngroup \nNN \n1 \n0.98 \nmaterial. mabr \nNN \n1 \n0.96 \n\n1 \n0.99 \n\nNNS \n\ncru..f \nNN \n1 \n0.98 \n\n0 \n0.03 \n\nNN \n1 \n0.98 \nwill \nMD \n0 \n0.03 \n\nof \nIN \n0 \n\nU.K. \nPNP \n1 \n\n0.02 \n\n0.99 \n\nbo:cune \n\nVB \n0 \n0.01 \n\nchainnon \n\nNN \n1 \n0.81 \n\n1 \n0.99 \n\n0 \n0.01 \n\nTable 1: Extraction of noun phrases using a suffix tree transducer. In this typical example, two long \nnoun phrases were identified correctly with high confidence. \nAcknowledgments \nThanks to Y. Bengio, Y. Freund, F. Pereira, D. Ron. R. Schapire. and N. Tishby for helpful discussions. \nThe work on syntactic structure induction is done in collaboration with I. Dagan and S. Engelson. \nThis work was done while the author was at the Hebrew University of Jerusalem. \nReferences \n[1] Y. Bengio and P. Fransconi. An input output HMM architecture. InNIPS-7. 1994. \n[2] N. Cesa-Bianchi. Y. Freund. D. Haussler. D.P. Helmbold, R.E. Schapire, and M. K. Warmuth. \n\nHow to use expert advice. In STOC-24, 1993. \n\n[3] T.M. Cover and J .A. Thomas. Elements of information theory. Wiley. 1991. \n[4] A. DeSantis. G. Markowski. and M.N. Wegman. Learning probabilistic prediction functions. \n\nIn Proc. of the 1st Wksp. on Comp. Learning Theory. pages 312-328,1988. \n\n[5] C.L. Giles. C.B. Miller, D. Chen, G.Z. Sun, H.H. Chen. and Y.C. Lee. Learning and extracting \nfinite state automata with second-orderrecurrent neural networks. Neural Computation. 4:393-\n405.1992. \n\n[6] D. Haussler and A. Barron. How well do Bayes methods work for on-line prediction of {+ 1, -1 } \n\nvalues? In The3rdNEC Symp . on Comput. andCogn., 1993. \n\n[7] D.P. HeImbold and R.E. Schapire. Predicting nearly as well as the best pruning of a decision \n\ntree. In COLT-8. 1995. \n\n[8] R.A. Jacobs, M.1. Jordan. SJ. NOWlan, and G.E. Hinton. Adaptive mixture of local experts. \n\nNeural Computation, 3:79-87. 1991. \n\n[9] R.E. Krichevsky and V.K. Trofimov. The performance of universal encoding. IEEE Trans. on \n\nInform. Theory. 1981. \n\n[10] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and \n\nComputation, 108:212-261,1994. \n\n[11] M.D. Riley. A statistical model for generating pronounication networks. In Proc. of IEEE Con/. \n\non Acoustics. Speech and Signal Processing. pages 737-740.1991. \n\n[12] D. Ron. Y. Singer, and N. Tishby. The power of amnesia. In NIPS-6. 1993. \n[13] F.MJ. Willems. Y.M. Shtarkov. and TJ. Tjalkens. The context tree weighting method: Basic \n\nproperties. IEEE Trans. Inform. Theory. 41(3):653-664.1995. \n\n\f", "award": [], "sourceid": 1099, "authors": [{"given_name": "Yoram", "family_name": "Singer", "institution": null}]}