{"title": "Using Random Forests in the Structured Language Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1545, "page_last": 1552, "abstract": null, "full_text": "Using Random Forests in the Structured\n\nLanguage Model\n\nCenter for Language and Speech Processing\n\nDepartment of Electrical and Computer Engineering\n\nPeng Xu and Frederick Jelinek\n\nThe Johns Hopkins University\nfxp,jelinekg@jhu.edu\n\nAbstract\n\nIn this paper, we explore the use of Random Forests (RFs) in the struc-\ntured language model (SLM), which uses rich syntactic information in\npredicting the next word based on words already seen. The goal in this\nwork is to construct RFs by randomly growing Decision Trees (DTs) us-\ning syntactic information and investigate the performance of the SLM\nmodeled by the RFs in automatic speech recognition.\nRFs, which were originally developed as classi\ufb01ers, are a combination\nof decision tree classi\ufb01ers. Each tree is grown based on random training\ndata sampled independently and with the same distribution for all trees in\nthe forest, and a random selection of possible questions at each node of\nthe decision tree. Our approach extends the original idea of RFs to deal\nwith the data sparseness problem encountered in language modeling.\nRFs have been studied in the context of n-gram language modeling and\nhave been shown to generalize well to unseen data. We show in this paper\nthat RFs using syntactic information can also achieve better performance\nin both perplexity (PPL) and word error rate (WER) in a large vocabulary\nspeech recognition system, compared to a baseline that uses Kneser-Ney\nsmoothing.\n\n1\n\nIntroduction\n\nIn many systems dealing with speech or natural language, such as Automatic Speech\nRecognition and Statistical Machine Translation, a language model is a crucial component\nfor searching in the often prohibitively large hypothesis space. Most state-of-the-art sys-\ntems use n-gram language models, which are simple and effective most of the time. Many\nsmoothing techniques that improve language model probability estimation have been pro-\nposed and studied in the n-gram literature [1]. There has so far been work in exploring\nDecision Tree (DT) language models [2, 3], which attempt to cluster similar histories to-\ngether to achieve better probability estimation. However, the results were negative [3]:\ndecision tree language models failed to improve upon the baseline n-gram models with the\nsame order n.\n\nRandom Forest (RF) language models, which are generalizations of DT language models,\nhave been recently applied to word n-grams [4]. DT growing is randomized to construct\n\n\fRFs ef\ufb01ciently. Once constructed, the RFs function as a randomized history clustering,\nwhich helps in dealing with the data sparseness problem. In general, the weakness of some\ntrees can be compensated for by other trees. The collective contribution of all DTs in an\nRF n-gram model results in signi\ufb01cant improvements in both perplexity (PPL) and word\nerror rate (WER) in a large vocabulary speech recognition system.\n\nLanguage models can also be improved with better representations of the history. Recent\nefforts have studied various ways of using information from a longer history span than that\nusually captured by normal n-gram language models, as well as ways of using syntactic\ninformation that is not available to the word-based n-gram models [5, 6, 7]. All these lan-\nguage models are based on stochastic parsing techniques that build up parse trees for the\ninput word sequence and condition the generation of words on syntactic and lexical infor-\nmation available in the parse trees. Since these language models capture useful hierarchical\ncharacteristics of language, they can improve PPL and WER signi\ufb01cantly for various tasks.\nHowever, due to the n-gram nature of the components of the syntactic language models,\nthe data sparseness problem can be severe.\n\nIn order to reduce the data sparseness problem for using rich syntactic information in the\ncontext, we study the use of RFs in the structured language model (SLM) [5]. Our results\nshow that although the components of the SLM have high order n-grams, our RF approach\ncan still achieve better performance, reducing both the perplexity (PPL) and word error\nrate (WER) in a large vocabulary speech recognition system compared to a Kneser-Ney\nsmoothing baseline.\n\n2 Basic Language Modeling\n\nThe purpose of a language model is to estimate the probability of a word string. Let W\ndenote a string of N words, that is, W = w1; w2; : : : ; wN . Then, by the chain rule of\nprobability, we have\n\nP (W )=P (w1)(cid:2)QN\n\ni=2 P (wijw1;:::;wi(cid:0)1):\n\n(1)\nIn order to estimate the probabilities P (wijw1; : : : ; wi(cid:0)1), we need a training corpus con-\nsisting of a large number of words. However, in any practical natural language system of\neven moderate vocabulary size, it is clear that the number of probabilities to be estimated\nand stored is prohibitively large. Therefore, histories w1; : : : ; wi(cid:0)1 for a word wi are usu-\nally grouped into equivalence classes. The most widely used language models, n-gram\nlanguage models, use the identities of the last n (cid:0) 1 words as equivalence classes. In an\nn-gram model, we then have\n\nP (W )=P (w1)(cid:2)QN\n\ni=2 P (wijwi(cid:0)1\n\ni(cid:0)n+1);\n\n(2)\n\nwhere we have used wi(cid:0)1\nIf we could handle unlimited amounts of training data, the maximum likelihood (ML)\nestimate of P (wijwi(cid:0)1\n\ni(cid:0)n+1 to denote the word sequence wi(cid:0)n+1; : : : ; wi(cid:0)1.\n\ni(cid:0)n+1) would be the best:\n\nP (wijwi(cid:0)1\n\ni(cid:0)n+1)=\n\nC(wi\n\nC(w\n\ni(cid:0)n+1 )\ni(cid:0)1\n)\ni(cid:0)n+1\n\n;\n\n(3)\n\nwhere C(wi\ndata.\n\ni(cid:0)n+1) is the number of times the string wi(cid:0)n+1; : : : ; wi is seen in the training\n\n2.1 Language Model Smoothing\n\nAn n-gram model when n = 3 is called a trigram model. For a vocabulary of size jV j =\n104, there are jV j3 = 1012 trigram probabilities to be estimated. For any training data of a\nmanageable size, many of the probabilities will be zero if the ML estimate is used.\n\n\fIn order to solve this problem, many smoothing techniques have been studied (see [1]\nand the references therein). Smoothing adjusts the ML estimates to produce more accurate\nprobabilities and to assign nonzero probabilities to any word string. Details about various\nsmoothing techniques will not be presented in this paper, but we will outline a particular\nway of smoothing, namely interpolated Kneser-Ney smoothing [8], for later reference.\n\nInterpolated Kneser-Ney smoothing assumes the following form:\n\nPKN (wijwi(cid:0)1\n\ni(cid:0)n+1)(cid:0)D;0)\ni(cid:0)1\ni(cid:0)n+1\n\n)\n\ni(cid:0)n+1) = max(C(wi\n\nC(w\n\n(4)\n\n+(cid:21)(wi(cid:0)1\n\ni(cid:0)n+1)PKN (wijwi(cid:0)1\n\ni(cid:0)n+2);\n\nwhere D is a discounting constant and (cid:21)(wi(cid:0)1\ni(cid:0)n+1) is the interpolation weight for the lower\norder probabilities ((n (cid:0) 1)-gram). The discount constant is often estimated using the\nleave-one-out method, leading to the approximation D = n1\n, where n1 is the number\nof n-grams with count one and n2 is the number of n-grams with count two. To ensure that\nthe probabilities sum to one, we have\n\nn1+2n2\n\n(cid:21)(wi(cid:0)1\n\ni(cid:0)n+1)=\n\nD P\n\nwi :C(wi\n\ni(cid:0)n+1\n\nC(w\n\ni(cid:0)1\ni(cid:0)n+1\n\n)\n\n1\n\n)>0\n\n:\n\nThe lower order probabilities in interpolated Kneser-Ney smoothing can be estimated as\n(assuming ML estimation):\n\nPKN (wijwi(cid:0)1\n\ni(cid:0)n+2)=\n\nP\n\nwi(cid:0)n+1 :C(wi\n\ni(cid:0)n+1\n\n1\n\n)>0\n\nP\n\nwi(cid:0)n+1 ;wi :C(wi\n\ni(cid:0)n+1\n\n)>0\n\n1 :\n\n(5)\n\nNote that the lower order probabilities are usually recursively smoothed using Equation 4.\n\n2.2 Language Model Evalution\n\nA commonly used task-independent quality measure for a given language model is related\nto the cross-entropy of the underlying model and is referred to as perplexity (PPL):\n\nP P L=exp((cid:0)1=N PN\n\ni=1 log [P (wijwi(cid:0)1\n\n1\n\n)]);\n\n(6)\n\nwhere w1; : : : ; wN is the test text that consists of N words.\nFor different tasks, there are different task-dependent quality measures of language mod-\nels. For example, in an automatic speech recognition system, the performance is usually\nmeasured by word error rate (WER).\n\n3 The Structured Language Model (SLM)\n\nThe SLM uses rich syntactic information beyond regular word n-grams to improve lan-\nguage model quality. An extensive presentation of the SLM can be found in Chelba and\nJelinek, 2000 [5]. The model assigns a probability P (W; T ) to every sentence W and ev-\nery possible binary parse T . The terminals of T are the words of W with POS tags, and\nthe nodes of T are annotated with phrase headwords and non-terminal labels. Let W be\n\nh_{-m} = (~~, SB)\n\nh_{-1}\n\nh_0 = (h_0.word, h_0.tag)\n\n(~~~~, SB) ....... (w_p, t_p) (w_{p+1}, t_{p+1}) ........ (w_k, t_k) w_{k+1}.... ~~\n\nFigure 1: A word-parse k-pre\ufb01x\n\n\fa sentence of length n words to which we have prepended the sentence beginning marker\n~~ and appended the sentence end marker ~~ so that w0 =~~ and wn+1 =~~.\nLet Wk = w0 : : : wk be the word k-pre\ufb01x of the sentence \u2014 the words from the beginning\nof the sentence up to the current position k \u2014 and WkTk the word-parse k-pre\ufb01x. Figure 1\nshows a word-parse k-pre\ufb01x; h_0, .., h_{-m} are the exposed heads, each head be-\ning a pair (headword, non-terminal label), or (word, POS tag) in the case of a root-only\ntree. The exposed heads at a given position k in the input sentence are a function of the\nword-parse k-pre\ufb01x [5].\n\nThe joint probability P (W; T ) of a word sequence W and a complete parse T comes from\ncontributions of three components: WORD-PREDICTOR, TAGGER and CONSTRUC-\nTOR. The SLM works in the following way: \ufb01rst, the WORD-PREDICTOR predicts a\nword based on the word-parse pre\ufb01x; the TAGGER then assigns a POS tag to the predicted\nword based on the word itself and the word-parse pre\ufb01x; the CONSTRUCTOR takes a\nseries of actions each of which turns a parse pre\ufb01x into a new parse pre\ufb01x (the series of\nactions ends with a NULL action which tells the WORD-PREDICTOR to predict the next\nword). Details about the three components can be found in [5]. Each of the three compo-\nnents can be seen as an n-gram model and can be estimated independently because of the\nproduct form of the joint probability. They are parameterized (approximated) as follows:\n\n1 :::pk\n\ni(cid:0)1) = P (pk\n\nP (pk\n\nP (tkjwk;Wk(cid:0)1Tk(cid:0)1) = P (tkjwk;h0:tag;h(cid:0)1:tag);\n\nP (wkjWk(cid:0)1Tk(cid:0)1) = P (wkjh0:tag;h0:word;h(cid:0)1:tag;h(cid:0)1:word);\n\n(7)\n(8)\n(9)\ni jWk(cid:0)1Tk(cid:0)1;wk;tk;pk\nwhere pk\ni is the ith CONSTRUCTOR action after the kth word and POS tag have been\npredicted. Since the number of parses for a given word pre\ufb01x Wk grows exponentially\nwith k, jfTkgj (cid:24) O(2k), the state space of our model is huge even for relatively short\nsentences. Thus we must use a search strategy that prunes the allowable parse set. One\nchoice is a synchronous multi-stack search algorithm [5] which is very similar to a beam\nsearch.\n\ni jh0:tag;h(cid:0)1:tag;h(cid:0)2:tag;h0:word;h(cid:0)1:word);\n\nThe language model probability assignment for the word at position k + 1 in the input\nsentence is made using:\n\nPSLM (wk+1jWk) = PTk 2Sk\n\nP (wk+1jWkTk)(cid:1)(cid:26)(Wk;Tk);\n\n(cid:26)(Wk;Tk) = P (WkTk)= PTk 2Sk\n\n(10)\nwhich ensures a proper probability normalization over strings of words, where Sk is the set\nof all parses present in the stacks at the current stage k and P (WkTk) is the joint probability\nof word-parse pre\ufb01x WkTk.\nEach model component \u2014WORD-PREDICTOR, TAGGER, CONSTRUCTOR\u2014 is esti-\nmated independently from a set of parsed sentences after undergoing headword percolation\nand binarization (see details in [5]).\n\nP (WkTk);\n\n4 Using Random Forests in the Structured Language Model\n\n4.1 Random Forest n-gram Modeling\n\nA Random Forest (RF) n-gram model is a collection of randomly constructed decision tree\n(DT) n-gram models. Unlike RFs in classi\ufb01cation and regression tasks [9, 10, 11], RFs\nare used in language modeling to deal with the data sparsenes problem [4]. Therefore, the\ntraining data is not randomly sampled for each DT. Figure 2 shows the algorithm DT-Grow\nand Node-Split used for generating random DT language models.\n\nWe de\ufb01ne a position in the history as the distance between a word in the history and the\npredicted word. The randomization is carried out in two places: a random selection of\n\n\fAlgorithm DT-Grow\nInput: counts for training and heldout data\n\nInitialize:\n\nCreate a root node containing all\nhistories in the training data and\nput it in set (cid:8)\n\nWhile (cid:8) is not empty\n\n1. Get a node p from (cid:8)\n2. If Node-Split(p) is successful, elim-\ninate p from (cid:8) and put the two chil-\ndren of p in (cid:8)\n\nForeach internal node p in the tree\n\n1. LH\n\np normalized likelihood of held-\nout data associated with p, using\ntraining data statistics in p\n\n2. Get the set of leaves P rooted in p\n3. LH\n\nP normalized likelihood of held-\nout data associated with all leaves in\nP, using training data statistics in the\ncorresponding leaves\nif LH\nrooted in p\n\np < 0, prune the subtree\n\nP (cid:0) LH\n\n4.\n\nOutput: a Decistion Tree language model\n\nAlgorithm Node-Split(p)\nInput: node p and training data associated\n\nInitialize:\n\nRandomly select a subset of posi-\ntions I in the history\n\nForeach position i in I\n\n1. Group all histories into basic elements\n\n(cid:12)(v)\n\n2. Randomly split\n\ninto sets L and R\n\nthe elements (cid:12)(v)\n\n3. While there are elements moved, Do\n\n(a) Move each element from L to\nR if the move results in positive\ngain in training data likelihood\n\n(b) Move each element from R to\nL if the move results in positive\ngain in training data likelihood\n\nSelect the position from I that results in the\nlargest gain\nOutput: a split L and R, or failure if the largest\ngain is not positive\n\nFigure 2: The algorithm DT-Grow and Node-Split\n\npositions in the history and an initial random split of basic elements. Since our splitting\ncriterion is to maximize the log-likelihood of the training data, each split uses only statistics\n(from training data) associated with the node under consideration. Smoothing is not needed\nin the splitting and we can use a fast exchange algorithm [12] in Node-Split. Given a\nposition i in the history, (cid:12)(v) is de\ufb01ned to be the set of histories belonging to the node p,\nsuch that they all have word v at position i. It is clear that for every position i in the history,\nthe union [v(cid:12)(v) is all histories in the node p.\nIn DT-Grow, after a DT is fully grown, we use some heldout data to prune it. Pruning is\ndone in such a way that we maximize the likelihood of the heldout data, where smoothing\nis applied according to Equation 4:\n\nPDT (wij(cid:8)DT (wi(cid:0)1\n\ni(cid:0)n+1)) = max(C(wi ;(cid:8)DT (w\n\ni(cid:0)1\ni(cid:0)n+1\n\n))(cid:0)D;0)\n\nC((cid:8)DT (w\n\ni(cid:0)1\ni(cid:0)n+1\n\n))\n\n(11)\n\n+(cid:21)((cid:8)DT (wi(cid:0)1\n\ni(cid:0)n+1))PKN (wijwi(cid:0)1\n\ni(cid:0)n+2)\n\nwhere (cid:8)DT ((cid:1)) is one of the DT nodes the history can be mapped to and PKN (wijwi(cid:0)1\ni(cid:0)n+2)\nis as de\ufb01ned in Equation 5. This pruning is similar to the pruning strategy used in\nCART [13].\n\nOnce we get the DTs, we only use the leaf nodes as equivalence classes of histories. If\na new history is encountered, it is very likely that we will not be able to place it at a leaf\nnode in the DT. In this case, (cid:21)((cid:8)DT (wi(cid:0)1\ni(cid:0)n+1)) = 1 in Equation 11 and we simply use\nPKN (wijwi(cid:0)1\nThe randomized version of the DT growing algorithm can be run many times and \ufb01nally\nwe will get a collection of randomly grown DTs: a Random Forest (RF). Since each DT is\na smoothed language model, we simply aggregate all DTs in our RF to get the RF language\n\ni(cid:0)n+2) to get the probabilities.\n\n\fmodel. Suppose we have M randomly grown DTs, DT1; : : : ; DTM . In the n-gram case,\nthe RF language model probabilities can be computed as:\n\nPRF (wijwi(cid:0)1\n\nM PM\ni(cid:0)n+1) maps the history wi(cid:0)1\n\ni(cid:0)n+1)= 1\n\nj=1 PDTj (wij(cid:8)DTj (wi(cid:0)1\ni(cid:0)n+1 to a leaf node in DTj. If wi(cid:0)1\n\ni(cid:0)n+1))\n\nwhere (cid:8)DTj (wi(cid:0)1\ni(cid:0)n+1 can not\nbe mapped to a leaf node in some DT, we back-off to the lower order KN probability as\nmentioned earlier.\n\n(12)\n\nIt can be shown by the Law of Large Numbers that the probability in Equation 12 converges\nas the number of DTs grows.\ni(cid:0)n+1)(cid:3) where T is a\nrandom variable representing the random DTs. The advantage of the RF approach over the\nKN smoothing lies in the fact that different DTs have different weaknesses and strengths\nfor word prediction. As the number of trees grows, the weakness of some trees can be\ncompensated for by some other trees. This advantage and the convergence have been shown\nexperimentally in [4].\n\nIt converges to ET (cid:2)PT (wij(cid:8)T (wi(cid:0)1\n\n4.2 Using RFs in the SLM\n\nSince the three model components in the SLM as in Equation 7-9 can be estimated inde-\npendently, we can construct an RF for each component using the algorithm DT-Grow in\nthe previous section. The only difference is that we will have different n-gram orders and\ndifferent items in the history for each model.\n\nIdeally, we would like to use RFs for each component in the SLM. However, due to the\nnature of the SLM, there are dif\ufb01culties. The SLM uses a synchronous multi-stack search\nalgorithm to dynamically construct stacks and compute the language model probabilities\nas in Equation 10. If we use RFs for all components, we need to load all DTs in the RFs\ninto memory at runtime. This is impractical for RFs of any reasonable size.\n\nThere is a different approach that can take advantage of the randomness in the RFs. Sup-\npose we have M randomly grown DTs, DT a\nM for each component a of the SLM,\nwhere a 2 fP; T; Cg for WORD-PREDICTOR, TAGGER and CONSTRUCTOR, respec-\ntively. The DTs are grouped into M triples fDT P\nj g j = 1; : : : ; M. We\ncalculate the joint probability P (W; T ) for the j th DT triple according to:\n\n1 ; : : : ; DT a\n\nj ; DT C\n\nj ; DT T\n\nPj (W;T )= Qn+1\n\nk=1 [PDT P\nj\n\n(wkjWk(cid:0)1Tk(cid:0)1)(cid:1)PDT T\nj\n\n(tkjWk(cid:0)1Tk(cid:0)1;wk)(cid:1)\n\nThen, the language model probability assignment for the j th DT triple is made using:\n\nQNk\n\ni=1 PDT C\nj\n\n(pk\n\ni jWk(cid:0)1Tk(cid:0)1;wk;tk;pk\n\n1 :::pk\n\ni(cid:0)1)]:\n\nPj (wk+1jWk) = P\n\nPDT P\nj\n\n(wk+1jWkT j\n\nk )(cid:1)(cid:26)j (Wk;T j\n\nk );\n\n2Sj\nk\nk ) = Pj (WkT j\n\nj\nk\n\nT\n\n(cid:26)j (Wk;T j\n\nk )= PT\n\nPj (WkT j\n\nk );\n\nj\nk\n\n2S\n\nj\nk\n\n(13)\n\n(14)\n\nwhich is achieved by running the synchronous multi-stack algorithm using the j th DT triple\nas a model. Finally, after the SLM is run M times, the RF language model probability is\nan average of the probabilities above:\n\nPRF (wk+1jWk)= 1\n\nM PM\n\nj=1 Pj (wk+1jWk):\n\n(15)\n\nj ; DT T\n\nj ; DT C\n\nj g can be considered as a single DT in which the root node\nThe triple fDT P\nhas three children corresponding to the three root nodes of DT P\nj . The\nroot node of this DT asks the question: Which model component does the history belong\nto? According to the answer, we can proceed to one of the three children nodes (one of the\nthree components, in fact). Since the multi-stack search algorithm is deterministic given\nthe DT, the probability in Equation 15 can be shown to converge.\n\nj and DT C\n\nj ; DT T\n\n\f5 Experiments\n\n5.1 Perplexity (PPL)\n\nWe have used the UPenn Treebank portion of the WSJ corpus to carry out our experiments.\nThe UPenn Treebank contains 24 sections of hand-parsed sentences, for a total of about\none million words. We used section 00-20 for training our models, section 21-22 as heldout\ndata for pruning the DTs, and section 23-24 to test our models. Before carrying out our\nexperiments, we normalized the text in the following ways: numbers in arabic form were\nreplaced by a single token \u201cN\u201d, punctuation was removed, all words were mapped to lower\ncase, extra information in the parse trees was ignored, and, \ufb01nally, traces were ignored. The\nword vocabulary contains 10k words including a special token for unknown words. There\nare 40 items in the part-of-speech set and 54 items in the non-terminal set, respectively.\n\nThe three components in the SLM were treated independently during training. We trained\nan RF for each component and each RF contained 100 randomly grown DTs. The baseline\nSLM used KN smoothing (KN-SLM). The 100 probability sequences from the 100 triples\nwere aggregated to get the \ufb01nal PPL. The results are shown in Table 1. We also interpolated\nthe SLM with the KN-trigram to get further improvements. The interpolation weight (cid:11) in\nTable 1 is on KN-trigram. The RF-SLM achieved a 10.9% and a 7.5% improvement over\nthe KN-SLM, before and after interpolation with KN-trigram, respectively. Compared to\nthe improvements reported in [4] (10.5% from RF-trigram to KN-trigram), the RF-SLM\nachieved greater improvement by using syntactic information. Figure 3 shows the conver-\ngence of the PPL as the number of DTs grows from 1 to 100.\n\n160\n\n155\n\n150\n\n145\n\nL\nP\nP\n\n140\n\n135\n\n130\n\n125\n\n120\n\n20\n\nModel\nKN-SLM 137.9\nRF-SLM 122.8\nGain\n\n(cid:11)=0.0 (cid:11)=0.4 (cid:11)=1.0\n145.0\n145.0\n\n127.2\n117.6\n10.9% 7.5%\n\n-\n\n80\n\n100\n\nTable 1: PPL comparison between KN-\nSLM and RF-SLM,\ninterpolated with\nKN-trigram\n\n40\n\n60\n\nNumber of DTs\n\nFigure 3: PPL convergence\n\n5.2 Word Error Rate by N-best Re-scoring\n\nTo test our RF modeling approach in the context of speech recognition, we evaluated the\nmodels in the WSJ DARPA\u201993 HUB1 test setup. The size of the test set is 213 utterances,\n3446 words. The 20k word open vocabulary and baseline 3-gram model are the standard\nones provided by NIST and LDC \u2014 see [5] for details. The N-best lists were generated\nusing the standard 3-gram model trained on 40M words of WSJ. The N-best size was at\nmost 50 for each utterance, and the average size was about 23. For the KN-SLM and RF-\nSLM, we used 20M words automatically parsed, binarized and enriched with headwords\nand NT/POS tag information. As the size of RF-SLM becomes very large, we only used\nRF for the WORD-PREDICTOR component (RF-SLM-P). The other two components used\nKN smoothing. The results are reported in Table 2.\n\nModel\nKN-SLM\nRF-SLM-P\n\n(cid:11)=0.0 (cid:11)=0.2 (cid:11)=0.4 (cid:11)=0.6 (cid:11)=0.8\n12.7\n12.8\n11.9\n12.6\n\n12.5\n12.2\n\n12.7\n12.3\n\n12.6\n12.3\n\nTable 2: N-best rescoring WER results\n\nFor purpose of comparison, we interpolated all models with the KN-trigram built from\n\n\f40M words at different level of interpolation weights (cid:11) (on KN-trigram). However, it is the\n(cid:11) = 0:0 column that is the most interesting. We can see that the RF approach improved\nover the regular KN approach with an absolute WER reduction of 0.9%.\n\n6 Conclusions\n\nBased on the idea of Random Forests in classi\ufb01cation and regression, we developed algo-\nrithms for constructing and using Random Forests in language modeling. In particular, we\napplied this new probability estimation technique to the Structured Language Model, in\nwhich there are three model components that can be estimated independently. The inde-\npendently constructed Random Forests can be considered as a more general single Random\nForest, which ensures the convergence of the probabilities as the number of Decision Trees\ngrows. The results on a large vocabulary speech recognition system show that we can\nachieve signi\ufb01cant reduction in both perplexity and word error rate, compared to a baseline\nusing Kneser-Ney smoothing.\n\nReferences\n\n[1] Stanley F. Chen and Joshua Goodman, \u201cAn empirical study of smoothing techniques for lan-\nguage modeling,\u201d Tech. Rep. TR-10-98, Computer Science Group, Harvard University, Cam-\nbridge, Massachusetts, 1998.\n\n[2] L. Bahl, P. Brown, P. de Souza, and R. Mercer, \u201cA tree-based statistical language model for\nnatural language speech recognition,\u201d in IEEE Transactions on Acoustics, Speech and Signal\nProcessing, July 1989, vol. 37, pp. 1001\u20131008.\n\n[3] Gerasimos Potamianos and Frederick Jelinek, \u201cA study of n-gram and decision tree letter\n\nlanguage modeling methods,\u201d Speech Communication, vol. 24(3), pp. 171\u2013192, 1998.\n\n[4] Peng Xu and Frederick Jelinek, \u201cRandom forests in language modeling,\u201d in Proceedings of the\n2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain,\nJuly 2004.\n\n[5] Ciprian Chelba and Frederick Jelinek, \u201cStructured language modeling,\u201d Computer Speech and\n\nLanguage, vol. 14, no. 4, pp. 283\u2013332, October 2000.\n\n[6] Eugene Charniak, \u201cImmediate-head parsing for language models,\u201d in Proceedings of the 39th\nAnnual Meeting and 10th Conference of the European Chapter of ACL, Toulouse, France, July\n2001, pp. 116\u2013123.\n\n[7] Brian Roark, Robust Probabilistic Predictive Syntactic Processing: Motivations, Models and\n\nApplications, Ph.D. thesis, Brown University, Providence, RI, 2001.\n\n[8] Reinhard Kneser and Hermann Ney, \u201cImproved backing-off for m-gram language modeling,\u201d\nin Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Pro-\ncessing, 1995, vol. 1, pp. 181\u2013184.\n\n[9] Y. Amit and D. Geman, \u201cShape quantization and recognition with randomized trees,\u201d Neural\n\nComputation, , no. 9, pp. 1545\u20131588, 1997.\n\n[10] Leo Breiman, \u201cRandom forests,\u201d Tech. Rep., Statistics Department, University of California,\n\nBerkeley, Berkeley, CA, 2001.\n\n[11] T.K. Ho, \u201cThe random subspace method for constructing decision forests,\u201d IEEE Trans. on\n\nPattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832\u2013844, 1998.\n\n[12] S. Martin, J. Liermann, and H. Ney, \u201cAlgorithms for bigram and trigram word clustering,\u201d\n\nSpeech Communication, vol. 24(3), pp. 171\u2013192, 1998.\n\n[13] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classi\ufb01cation and Regression Trees,\n\nChapman and Hall, New York, 1984.\n\n\f", "award": [], "sourceid": 2667, "authors": [{"given_name": "Peng", "family_name": "Xu", "institution": null}, {"given_name": "Frederick", "family_name": "Jelinek", "institution": null}]}