{"title": "Application of Variational Bayesian Approach to Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1261, "page_last": 1268, "abstract": "", "full_text": "Application of Variational Bayesian Approach to\n\nSpeech Recognition\n\nShinji Watanabe, Yasuhiro Minami, Atsushi Nakamura and Naonori Ueda\n\nNTT Communication Science Laboratories, NTT Corporation\n\n2-4, Hikaridai, Seika-cho, Soraku-gun, Kyoto, Japan\n\nfwatanabe,minami,ats,uedag@cslab.kecl.ntt.co.jp\n\nAbstract\n\nIn this paper, we propose a Bayesian framework, which constructs\nshared-state triphone HMMs based on a variational Bayesian approach,\nand recognizes speech based on the Bayesian prediction classi(cid:2)cation;\nvariational Bayesian estimation and clustering for speech recognition\n(VBEC). An appropriate model structure with high recognition perfor-\nmance can be found within a VBEC framework. Unlike conventional\nmethods, including BIC or MDL criterion based on the maximum likeli-\nhood approach, the proposed model selection is valid in principle, even\nwhen there are insuf(cid:2)cient amounts of data, because it does not use\nan asymptotic assumption.\nIn isolated word recognition experiments,\nwe show the advantage of VBEC over conventional methods, especially\nwhen dealing with small amounts of data.\n\n1 Introduction\n\nA statistical modeling of spectral features of speech (acoustic modeling) is one of the most\ncrucial parts in the speech recognition.\nIn acoustic modeling, a triphone-based hidden\nMarkov model (triphone HMM) has been widely employed. The triphone is a context\ndependent phoneme unit that considers both the preceding and following phonemes. Al-\nthough the triphone enables the precise modeling of spectral features, the total number of\ntriphones is too large to prepare suf(cid:2)cient amounts of training data for each triphone. In\norder to deal with the problem of data insuf(cid:2)ciency, an HMM state is usually shared among\nmultiple triphone HMMs, which means the amount of training data per state in(cid:3)ates. Such\nshared-state triphone HMMs (SST-HMMs) can be constructed by successively clustering\nstates based on the phonetic decision tree method [4] [7]. The important practical problem\nthat must be solved when constructing SST-HMMs is how to optimize the total number\nof shared states adaptively to the amounts of available training data. Namely, maintaining\nthe balance between model complexity and training data size is quite important for high\ngeneralization performance.\n\nThe maximum likelihood (ML) is inappropriate as a model selection criterion since ML\nincreases monotonically as the number of states increases. Some heuristic thresholding\nis therefore necessary to terminate the partitioning. To solve this problem, the Bayesian\ninformation criterion (BIC) and minimum description length (MDL) criterion have been\n\n\femployed to determine the tree structure of SST-HMMs [2] [5] 1. However, since the\nBIC/MDL is based on an asymptotic assumption, it is invalid in principle when the number\nof training data is small because of the failure of the assumption.\n\nIn this paper, we present a practical method within the Bayesian framework for estimat-\ning posterior distributions over parameters and selecting an appropriate model structure of\nSST-HMMs (clustering triphone HMM states) based on a variational Bayesian (VB) ap-\nproach, and recognizing speech based on the Bayesian prediction classi(cid:2)cation: variational\nBayesian estimation and clustering for speech recognition (VBEC). Unlike the BIC/MDL,\nVB does not assume asymptotic normality, and it is therefore applicable in principle, even\nwhen there are insuf(cid:2)cient data. The VB approach has been successfully applied to model\nselection problems, but mainly for relatively simple mixture models [1] [3] [6] [8]. Here,\nwe try to apply VB to SST-HMMs with more a complex model structure than the mixture\nmodel and evaluate the effectiveness through a large-scale real speech recognition experi-\nment.\n\n2 Variational Bayesian framework\n\nFirst, we brie(cid:3)y review the VB framework. Let O be a given data set. In the Bayesian\napproach we are interested in posterior distributions over model parameters, p((cid:2)jO; m),\nand the model structure, p(mjO). Here, (cid:2) is a set of model parameters and m is an index\nof the model structure. Let us consider a general probabilistic model with latent variables.\nLet Z be a set of latent variables. Then the model with a (cid:2)xed model structure m can be\nde(cid:2)ned by the joint distribution p(O; Zj(cid:2); m).\n\nIn VB, variational posteriors q((cid:2)jO; m), q(ZjO; m), and q(mjO) are introduced to ap-\nproximate the true corresponding posteriors. The optimal variational posteriors over (cid:2) and\nZ, and the appropriate model structure that maximizes the optimal q(mjO) can be obtained\nby maximizing the following objective function:\n\nFm[q] = (cid:28)log\n\np(O; Zj(cid:2); m)p((cid:2)jm)\n\nq(ZjO; m)q((cid:2)jO; m)(cid:29)q(ZjO;m);q((cid:2) jO;m)\n\n;\n\n(1)\n\nw.r.t. q((cid:2)jO; m); q(ZjO; m), and m. Here hf (x)ip(x) denotes the expectation of f (x)\nw.r.t. p(x). p((cid:2)jm) is a prior distribution. This optimization can be effectively performed\nby an EM-like iterative algorithm (see [1] for the details).\n\n3 Applying a VB approach to acoustic models\n\n3.1 Output distributions and prior distributions\n\nWe attempt to apply a VB approach to a left-to-right HMM, which has been widely used\nto represent a phoneme unit in acoustic models for speech recognition, as shown in Figure\n1. Let O = fOt 2 RD : t = 1; :::; T g be a sequential data set for a phoneme unit. The\noutput distribution in an HMM is given by\n\np(O; S; V j(cid:2); m) = YT\n\nt= 1\n\nast(cid:0)1st cstvt bstvt (Ot);\n\n(2)\n\nwhere S is a set of sequences of hidden states, V is a set of sequences of Gaussian mixture\ncomponents, and st and vt denote the state and mixture components at time t. S and V are\nsets of discrete latent variables that correspond to Z mentioned above. aij denotes the state\n\n1These criteria have been independently proposed, but they are practically the same. Therefore,\n\nwe refer to them hereafter as BIC/MDL.\n\n\fa\n11\n\na\n\n22\n\na\n\n33\n\n1=i\n\n2=i\n\na\n\n12\n\na\n\n23\n\n3=i\n\nGaussian mixture for state i\n\nFigure 1: Hidden Markov model for each\nphoneme unit. A state is represented by\nthe Gaussian mixture distribution below\nthe state. There are three states and three\nGaussian components in this (cid:2)gure.\n\ntransition probability from state i to state j, and cjk is the k-th weight factor of the Gaussian\nmixture for state j. bjk(= N (Otj(cid:22)jk; (cid:6)jk)) denotes the Gaussian distribution with mean\nvector (cid:22)jk and covariance (cid:6)jk. (cid:2) = faij; cjk; (cid:22)jk; (cid:6)(cid:0)1\njk ji; j = 1; :::; J; k = 1; :::; Lg is\na set of model parameters. J denotes the number of states in an HMM and L denotes the\nnumber of Gaussian components in a state. In this paper, we restrict covariance matrices in\nthe Gaussian distribution to diagonal ones. The conjugate prior distributions are assumed\nto be as follows:\n\np((cid:2)jm) = Yi;j;k\n\nD(faij0 gJ\n\nj 0 = 1j(cid:30)0)D(fcjk0 gL\n\nk0 = 1j\u20190)\n\n(cid:2) N ((cid:22)jkj(cid:23) 0\n\njk; ((cid:24)0)(cid:0)1(cid:6)jk) YD\n\nd= 1\n\nG((cid:6)(cid:0)1\n\njk;dj(cid:17)0; R0\n\njk;d):\n\n(3)\n\n(cid:8)0 = f(cid:30)0; \u20190; (cid:23) 0\njkg is a set of hyperparameters. We assume the hyperparam-\neters are constants. In Eq.(3), D denotes a Dirichlet distribution and G denotes a gamma\ndistribution.\n\njk; (cid:24)0; (cid:17)0; R0\n\n3.2 Optimal variational posterior distribution ~q((cid:2)jO; m)\n\nFrom the output distributions and prior distributions in section 3.1, the optimal variational\nposterior distribution ~q((cid:2)jO; m) can be obtained as:\n\nj= 1jO; m) = D(faijgJ\nk= 1jO; m) = D(fcjkgL\n\n~q(faijgJ\n~q(fcjkgL\n~q(bjkjO; m)\n\nj= 1jf ~(cid:30)ijgJ\nk= 1jf ~\u2019jkgL\n\nj= 1)\nk= 1)\njk (cid:6)jk) QD\n\njk;dj~(cid:17)jk; ~Rjk;d);\n~(cid:8) (cid:17) f ~(cid:30); ~\u2019; ~(cid:23)jk; ~(cid:24); ~(cid:17); ~Rjkg is a set of posterior distribution parameters de(cid:2)ned as:\n\n= N ((cid:22)jkj~(cid:23) jk; ~(cid:24)(cid:0)1\n\nd= 1 G((cid:6)(cid:0)1\n\n(4)\n\njk + XT\n~(cid:30)ij = (cid:30)0 + ~(cid:13)ij; ~\u2019jk = \u20190 + ~(cid:16)jk; ~(cid:24)jk = (cid:24)0 + ~(cid:16)jk; ~(cid:23)jk = (cid:0)(cid:24)0(cid:23) 0\n~(cid:17)jk = (cid:17)0 + ~(cid:16)jk; ~Rjk;d = R0\n~(cid:16) t\njk(Ot\n\njk;d (cid:0) ~(cid:23)jk;d)2 + XT\n\njk;d + (cid:24)0((cid:23) 0\n\nt= 1\n\nt= 1\n\n~(cid:16) t\njkOt(cid:1)= ~(cid:24)jk;\n\nd (cid:0) ~(cid:23)jk;d)2:\n\n(5)\n\n~(cid:8) is composed of ~(cid:13)t\nkjO; m) and ~(cid:16)jk (cid:17) (cid:6)T\ntime t. ~(cid:16) t\n\nij (cid:17) ~q(st = i; st+ 1 = jjO; m), ~(cid:13)ij (cid:17) (cid:6)T\n\njk (cid:17) ~q(st = j; vt =\nij denotes the transition probability from state i to state j at\njk denotes the occupation probability on mixture component k in state j at time t.\n\n~(cid:16) t\njk. ~(cid:13)t\n\nt= 1~(cid:13)t\n\nij , ~(cid:16) t\n\nt= 1\n\n3.3 Optimal variational posterior distribution ~q(S; V jO; m)\n\nFrom the output distributions and prior distributions in section 3.1, the optimal variational\nposterior distribution over latent variables ~q(S; V jO; m) can be obtained as:\n\n~q(S; V jO; m) / YT\n\nt= 1\n\n~ast(cid:0)1st ~cstvt~bstvt (Ot);\n\n(6)\n\n\fwhere\n\n~ast(cid:0)1st = e x p(cid:8)(cid:9) ( ~(cid:30)st(cid:0)1st ) (cid:0) (cid:9) (XJ\n~cstvt = e x p(cid:8)(cid:9) ( ~\u2019stvt ) (cid:0) (cid:9) (XL\n\nvt0\n\n=1\n\n~bstvt (Ot) = e x p(cid:26)D=2(cid:18)log 2(cid:25) (cid:0) 1= ~(cid:24)stvt + (cid:9) (~(cid:17)stvt =2)(cid:19) (cid:0)\n\nst0\n\n=1\n\n~(cid:30)st(cid:0)1 st0 )(cid:9);\n~\u2019stvt0 )(cid:9);\n\n(cid:0)1=2XD\n\nd=1(cid:18)log( ~Rstvt;d=2) + (Ot\n\nd (cid:0) ~(cid:23)stvt;d)2 ~(cid:17)stvt = ~Rstvt;d(cid:19)(cid:27): (7)\n\n(cid:9) (y) is a digamma function. From these results, transition and occupation probability ~(cid:13)t\nij\nand ~(cid:16) t\nij can be obtained by using either a deterministic assignment via the Viterbi algorithm\nor a probabilistic assignment via the Forward-Backward algorithm. Thus, ~q((cid:2) jO; m) and\n~q(S; V jO; m) can be calculated iteratively that result in maximizing Fm.\n\n4 VB training algorithm for acoustic models\n\nBased on the discussion in section 3, a VB training algorithm for an acoustic model based\non an HMM and Gaussian mixture model with a (cid:2)xed model structure m is as follows:\n(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)\nStep 1) Initialize ~(cid:13)t\n\nij[(cid:28) = 0] and set (cid:8)0.\n\nij[(cid:28) = 0], ~(cid:16) t\n\nStep 2) Compute q(S; V jO; m)[(cid:28) + 1] using ~(cid:13)t\n\nij[(cid:28) ], ~(cid:16) t\n\nij[(cid:28) ] and (cid:8)0.\n\nStep 3) Update ~(cid:13)t\n\nij[(cid:28) +1] and ~(cid:16) t\n\nij[(cid:28) +1] using q(S; V jO; m)[(cid:28) +1] via the Viterbi algorithm\n\nor Forward-Backward algorithm.\n\nStep 4) Compute ~(cid:8)[(cid:28) + 1] using ~(cid:13)t\nStep 5) Compute q((cid:2) jO; m)[(cid:28) + 1] using ~(cid:8)[(cid:28) + 1] and calculate Fm[(cid:28) ] based on\n\nij[(cid:28) + 1] and (cid:8)0.\n\nij[(cid:28) + 1], ~(cid:16) t\n\nq((cid:2) jO; m)[(cid:28) + 1] and q(S; V jO; m)[(cid:28) + 1].\n\nStep 6) If j(Fm[(cid:28) + 1] (cid:0) Fm[(cid:28) ])=Fm[(cid:28) + 1]j (cid:20) \", then stop; otherwise set (cid:28) (cid:28) + 1 and\n\ngo to Step 2.\n\n(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)(cid:151)\n(cid:28) denotes an iteration count. In our experiments, we employed the Viterbi algorithm in\nStep 3.\n\n5 Variational Bayesian estimation and clustering for speech\n\nrecognition\n\nIn the previous section, we described a VB training algorithm for HMMs. Here, we explain\nVBEC, which constructs an acoustic model based on SST-HMMs and recognizes speech\nbased on the Bayesian prediction classi(cid:2)cation. VBEC consists of three phases: model\nstructure selection, retraining and recognition. The model structure is determined based on\ntriphone-state clustering by using the phonetic decision tree method [4] [7]. The phonetic\ndecision tree is a kind of binary tree that has a phonetic (cid:147)Yes/No(cid:148) question attached at each\nnode, as shown in Figure 2. Let (cid:10) (n) denote a set of states held by a tree node n. We\nstart with only a root node (n = 0), which holds a set of all the triphone HMM states\n(cid:10) (0) for an identical center phoneme. The set of triphone states is then split into two sets,\n(cid:10) (nY ) and (cid:10) (nN ), which are held by two new nodes, nY and nN , respectively, as shown\nin Figure 3. The partition is determined by an answer to a phonetic question such as (cid:147)is\nthe preceding phoneme a vowel?(cid:148) or (cid:147)is the following phoneme a nasal?(cid:148) We choose a\nparticular question for a node that maximize the gain of F m when the node is split into two\n\n\f*/a(i)/*\n\nYes\n\nNo\n\nYes No\n\nYes No\n\nYes No\n\nk/a(i)/i\nk/a(i)/o\n\n\u0153\n\nYes No\n\nts/a(i)/m\nch/a(i)/n g\n\n\u0153\n\nroot n od e(n=0)\n\nleaf n od e\n\nW(n)\n\nn\n\nYes\n\nNo\n\nPhonetic\nquestion \n\nn\n\nY\n\nW(n\n\nY)\n\nn\n\nN\n\nW(n\n\nN)\n\nFigure 2: A set of all triphone HMM\nstates */a(i)/* is clustered based on the\nphonetic decision tree method.\n\nFigure 3: Splitting a set of triphone\nHMM states (cid:10)(n) into two sets (cid:10)(nY )\n(cid:10)(nN ) by answering phonetic questions\naccording to an objective function.\n\nnodes, and if all the questions decrease F m after splitting, we stop splitting. We continue\nthis splitting successively for every new set of states to obtain a binary tree, each leaf node\nof which holds a clustered set of triphone states. The states belonging to the same cluster\nare merged into a single state. A set of triphones is thus represented by a set of shared-\nstate triphone HMMs (SST-HMMs). An HMM, which represents a phonemic unit, usually\nconsists of a linear sequence of three or four states. A decision tree is produced speci(cid:2)cally\nfor each state in the sequence, and the trees are independent of each other.\n\nNote that in the triphone-states clustering mentioned above, we assume the following con-\nditions to reduce computations:\n\n(cid:15) The state assignments while splitting are (cid:2)xed.\n(cid:15) A single Gaussian distribution for one state is used.\n(cid:15) Contributions of the transition probabilities to the objective function are ignored.\nBy using these conditions, latent variables are removed. As a result, all variational posteri-\nors and Fm can be obtained as closed forms without an iterative procedure.\n\nOnce we have obtained the model structure, we retrain the posterior distributions us-\nIn recognition, an unknown datum xt for a\ning the VB algorithm given in section 4.\nframe t is classi(cid:2)ed as the optimal phoneme class y using the predictive posterior clas-\nsi(cid:2)cation probability p(yjxt; O; ~m) (cid:17) p(y)p(xtjy; O; ~m)=p(xt) for the estimated model\nstructure ~m. Here, p(y) is the class prior obtained by language and lexcon models, and\np(xtjy; O; ~m) is the predictive density. If we approximate the true posterior p((cid:2)jy; O; ~m)\nby the estimated variational posteriors ~q((cid:2)jy; O; ~m), p(xtjy; O; ~m) can be calculated by\n\np(xtjy; O; ~m) (cid:25) R p(xtjy; (cid:2); ~m)~q((cid:2)jy; O; ~m)d(cid:2). Therefore, the optimal class y can be\n\nobtained by\n\ny = arg m ax\n\ny 0\n\np(y 0jxt; O; ~m) (cid:25) arg m ax\n\ny 0\n\np(y 0)Z p(xtjy 0; (cid:2); ~m)~q((cid:2)jy; O; ~m)d(cid:2):\n\n(8)\n\nIn the calculation of (8), the integral over Gaussian means and covariances for a frame can\nbe solved analytically to be Student distributions. Therefore, we can compute a Bayesian\npredictive score for a frame, and then can compute a phoneme sequence score by using\nthe Viterbi algorithm. Thus, we can construct a VBEC framework for speech recognition\nby selecting an appropriate model structure and estimating posterior distributions with the\nVB approach, and then obtaining a recognition result based on the Bayesian prediction\nclassi(cid:2)cation.\n\n\fTable 1: Acoustic conditions\n\nTable 2: Prepared HMM\n\nSampling rate\nQuantization\nFeature vector\n\n16 kHz\n16 bit\n12 - order MFCC\nwith (cid:1) MFCC\n\nWindow\nFrame size/shift\n\nHamming\n25/10 ms\n\n6 Experiments\n\n# of states\n# of phoneme categories\nOutput distribution\n\n3 (Left to right)\n27\nSingle Gaussian\n\nWe conducted two experiments to evaluate the effectiveness of VBEC. The (cid:2)rst experi-\nment compared VBEC with the conventional ML-BIC/MDL method for variable amounts\nof training data. In the ML-BIC/MDL, retraining and recognition are based on the ML\napproach and model structure selection is based on the BIC/MDL. The second experiment\nexamined the robustness of the recognition performance with preset hyperparameter values\nagainst changes in the amounts of training data.\n\n6.1 VBEC versus ML-BIC/MDL\n\nThe experimental conditions are summarized in Tables 1 and 2. As regards the hyperpa-\nrameters, the mean and variance values of the Gaussian distribution were set at (cid:23) 0 and\nR0 in each root node, respectively, and the heuristics were removed for (cid:23)0 and R0. The\ndetermination of (cid:24)0 and (cid:17)0 was still heuristic. We set (cid:24)0 = (cid:17)0 = 0:01, each of which were\ndetermined experimentally. The training and recognition data used in these experiments\nare shown in Table 3.\n\nThe total training data consisted of about 3,000 Japanese sentences spoken by 30 males.\nThese sentences were designed so that the phonemic balance was maintained. The total\nrecognition data consisted of 2,500 Japanese city names spoken by 25 males. Several\nsubsets were randomly extracted from the training data set, and each subset was used to\nconstruct a set of SST-HMMs. As a result, 40 sets of SST-HMMs were prepared for several\nsubsets of training data.\n\nFigures 4 and 5 show the recognition rate and the total number of states in a set of SST-\nHMMs, according to the varying amounts of training data. As shown in Figure 4, when\nthe number of training sentences was less than 40, VBEC greatly outperformed the ML-\nBIC/MDL (A). With ML-BIC/MDL (A), an appropriate model structure was obtained by\nmaximizing an objective function lBIC=M DL\n\nw.r.t. m based on BIC/MDL de(cid:2)ned as:\n\nm\n\nlBIC=M DL\nm\n\n= l(O; m) (cid:0)\n\n# ((cid:2) (cid:10))\n\n2\n\nlog T(cid:10)(0);\n\n(9)\n\nwhere, l(O; m) denotes the likelihood of training data O for a model structure m, # ((cid:2) (cid:10))\ndenotes the number of free parameters for a set of states (cid:10), and T(cid:10)(0) denotes the total\nframe number of training data for a set of states (cid:10)(0) in a root node, as shown in Figure 2.\nThe term # ((cid:2) (cid:10))\nlog T(cid:10)(0) in Eq.(9) is regarded as a penalty term added to a likelihood, and\nis dependent on the number of free parameters # ((cid:2) (cid:10)) and total frame number T(cid:10)(0) of the\ntraining data. ML-BIC/MDL (A) was based on the original de(cid:2)nitions of BIC/MDL and\nhas been widely used in speech recognition [2] [5]. With such small amounts of training\ndata, there was a great difference between the total number of shared states with VBEC and\n\n2\n\nTable 3: Training and recognition data\n\nTraining\nRecognition\n\nContinuous speech sentences (Acoustical Society of Japan)\n100 city names (Japan Electronic Industry Development Association)\n\n\f(cid:14976)\n(cid:14972)\n(cid:14975)\n(cid:14967)\n(cid:15036)\n(cid:15051)\n(cid:15032)\n(cid:15049)\n(cid:14967)\n(cid:15045)\n(cid:15046)\n(cid:15040)\n(cid:15051)\n(cid:15040)\n(cid:15045)\n(cid:15038)\n(cid:15046)\n(cid:15034)\n(cid:15036)\n(cid:15017)\n\n(cid:14984)(cid:14983)(cid:14983)\n\n(cid:14991)(cid:14983)\n\n(cid:14989)(cid:14983)\n\n(cid:14987)(cid:14983)\n\n(cid:14985)(cid:14983)\n\n(cid:14983)\n\n(cid:14984)\n\n(cid:14984)(cid:14983)\n\n(cid:14984)(cid:14983)(cid:14983)(cid:14983)(cid:14983)\n\n(cid:15050)\n(cid:15036)\n(cid:15051)\n(cid:15032)\n(cid:15051)\n(cid:15050)\n(cid:14967)\n(cid:15037)\n(cid:15046)\n(cid:14967)\n(cid:14970)\n\n(cid:14984)(cid:14983)(cid:14983)(cid:14983)\n\n(cid:14984)(cid:14983)(cid:14983)\n\n(cid:14984)(cid:14983)\n\n(cid:15021)(cid:15001)(cid:15004)(cid:15002)\n(cid:15012)(cid:15011)(cid:14980)(cid:15001)(cid:15008)(cid:15002)(cid:14982)(cid:15012)(cid:15003)(cid:15011)(cid:14967)(cid:14975)(cid:15000)(cid:14976)\n(cid:15012)(cid:15011)(cid:14980)(cid:15001)(cid:15008)(cid:15002)(cid:14982)(cid:15012)(cid:15003)(cid:15011)(cid:14967)(cid:14975)(cid:15001)(cid:14976)\n\n(cid:15021)(cid:15001)(cid:15004)(cid:15002)\n(cid:15012)(cid:15011)(cid:14980)(cid:15001)(cid:15008)(cid:15002)(cid:14982)(cid:15012)(cid:15003)(cid:15011)(cid:14967)(cid:14975)(cid:15000)(cid:14976)\n(cid:15012)(cid:15011)(cid:14980)(cid:15001)(cid:15008)(cid:15002)(cid:14982)(cid:15012)(cid:15003)(cid:15011)(cid:14967)(cid:14975)(cid:15001)(cid:14976)\n\n(cid:14984)(cid:14983)(cid:14983)\n\n(cid:14984)(cid:14983)(cid:14983)(cid:14983)\n\n(cid:14970)(cid:14967)(cid:15046)(cid:15037)(cid:14967)(cid:15050)(cid:15036)(cid:15045)(cid:15051)(cid:15036)(cid:15045)(cid:15034)(cid:15036)(cid:15050)\n\n(cid:14984)(cid:14983)(cid:14983)(cid:14983)(cid:14983)\n\n(cid:14984)\n\n(cid:14984)(cid:14983)\n\n(cid:14984)(cid:14983)(cid:14983)\n\n(cid:14984)(cid:14983)(cid:14983)(cid:14983)\n\n(cid:14984)(cid:14983)(cid:14983)(cid:14983)(cid:14983)\n\n(cid:14970)(cid:14967)(cid:15046)(cid:15037)(cid:14967)(cid:15050)(cid:15036)(cid:15045)(cid:15051)(cid:15036)(cid:15045)(cid:15034)(cid:15036)(cid:15050)\n\nFigure 4: Recognition rates according to the\namounts of training data based on the VBEC\nand ML-BIC/MDL (A) and (B). The hori-\nzontal axis is scaled logarithmically.\n\nFigure 5: Number of shared states accord-\ning to the amounts of training data based on\nthe VBEC and ML-BIC/MDL (A) and (B).\nThe horizontal and vertical axes are scaled\nlogarithmically.\n\nML-BIC/MDL (A) (Figure 5). This suggests that VBEC, which does not use an asymptotic\nassumption, determines the model structure more appropriately than the ML-BIC/MDL\n(A), when the training data size is small.\n\nNext, we adjusted the penalty term of ML-BIC/MDL in Eq. (9) so that the total numbers of\nstates for small amounts of data were as close as possible to those of VBEC (ML-BIC/MDL\n(B) in Figure 5). Nevertheless, the recognition rates obtained by VBEC were about 15 %\nbetter than those of ML-BIC/MDL (B) with fewer than 15 training sentences (Figure 4).\nWith such very small amounts of data, the VBEC and ML-BIC/MDL (B) model structures\nwere almost same (Figure 5). It is assumed that the effects of the posterior estimation and\nthe Bayesian prediction classi(cid:2)cation (Eq. (8)) suppressed the over-(cid:2)tting of the models to\nvery small amounts of training data compared with the ML estimation and recognition in\nML-BIC/MDL (B).\n\nWith more than 100 training sentences, the recognition rates obtained by VBEC converged\nasymptotically to those obtained by ML-BIC/MDL methods as the amounts of training data\nbecame large.\n\nIn summary, VBEC performed as well or better for every amount of training data. This\nadvantage was due to the superior properties of VBEC, e.g., the appropriate determination\nof the number of states and the suppression effect on over-(cid:2)tting.\n\n6.2\n\nIn(cid:3)uence of hyperparameter values on the quality of SST-HMMs\n\nThroughout the construction of the model structure, the estimation of the posterior distri-\nbution, and recognition, we used a (cid:2)xed combination of hyperparameter values, (cid:24)0 = (cid:17)0 =\n0:01. In the small-scale experiments conducted in previous research [1] [3] [6] [8], the\nselection of such values was not a major concern. However, when the scale of the target\napplication is large, the selection of hyperparameter values might affect the quality of the\nmodels. Namely, the best or better values might differ greatly according to the amounts of\ntraining data. Moreover, estimating appropriate hyperparameters with training SST-HMMs\nneeds so much time that it is impractical in speech recognition. Therefore, we examined\nhow robustly the SST-HMMs produced by VBEC performed against changes in the hyper-\nparameter values with varying amounts of training data.\n\nWe varied the values of hyperparameters (cid:24)0 and (cid:17)0 from 0.0001 to 1, and examined the\nspeech recognition rates in two typical cases; one in which the amount of data was very\nsmall (10 sentences) and one in which the amount was fairly large (150 sentences). Tables\n\n\fTable 4: Recognition rates in each prior\ndistribution parameter when using train-\ning data of 10 sentences.\n\nTable 5: Recognition rates in each prior\ndistribution parameter when using train-\ning data of 150 sentences.\n\n(cid:24)0\n\n100\n10(cid:0)1\n10(cid:0)2\n10(cid:0)3\n10(cid:0)4\n\n(cid:17)0\n\n100\n\n10(cid:0)1 10(cid:0)2 10(cid:0)3 10(cid:0)4\n\n66.3\n1.0\n2.2\n65.9\n31.2 66.1\n60.3 66.2\n66.5 66.6\n\n65.9\n66.2\n\n66.5\n\n66.7\n\n66.3\n\n66.5\n\n66.7\n\n66.3\n66.1\n65.5\n\n66.1\n66.1\n65.5\n65.5\n64.6\n\n(cid:24)0\n\n100\n10(cid:0)1\n10(cid:0)2\n10(cid:0)3\n10(cid:0)4\n\n(cid:17)0\n\n100\n\n10(cid:0)1 10(cid:0)2 10(cid:0)3 10(cid:0)4\n\n22.0 93.5\n49.3 94.3\n83.5 94.4\n92.5 93.8\n94.1 93.2\n\n94.0\n\n93.1\n\n93.9\n\n93.2\n\n93.3\n92.3\n\n93.3\n92.3\n92.5\n92.3\n\n92.3\n92.5\n92.3\n92.4\n92.2\n\n4 and 5 show the recognition rates for each combination of hyperparameters. We can\nsee that the hyperparameter values for acceptable performance are broadly distributed for\nboth very small and fairly large amounts of training data. Moreover, roughly the ten best\nrecognition rates are highlighted in the tables. The combinations of hyperparameter values\nthat achieved the highlighted recognition rates were similar for the two different amounts of\ntraining data. Namely, appropriate combinations of hyperparameter values can consistently\nprovide good performance levels regardless of the varying amounts of training data.\n\nIn summary, the hyperparameter values do not greatly in(cid:3)uence the quality of the SST-\nHMMs. This suggests that it is not necessary to select the hyperparameter values very\ncarefully.\n\n7 Conclusion\n\nIn this paper, we proposed VBEC, which constructs SST-HMMs based on the VB approach,\nand recognizes speech based on the Bayesian prediction classi(cid:2)cation. With VBEC, the\nmodel structure of SST-HMMs is adaptively determined according to the amounts of given\ntraining data, and therefore a robust speech recognition system can be constructed. The (cid:2)rst\nexperimental results, obtained by using real speech recognition tasks, showed the effec-\ntiveness of VBEC. In particular, when the training data size was small, VBEC signi(cid:2)cantly\noutperformed conventional methods. The second experimental results suggested that it is\nnot necessary to select the hyperparameter values very carefully. From these results, we\nconclude that VBEC provides a completely Bayesian framework for speech recognition\nwhich effectively hundles the sparse data problem.\n\nReferences\n\n[1] H. Attias, (cid:147)A Variational Bayesian Framework for Graphical Models,(cid:148) NIPS12, MIT Press,\n\n(2000).\n\n[2] W. Chou and W. Reichl, (cid:147)Decision Tree State Tying Based on Penalized Bayesian Information\n\nCriterion,(cid:148) Proc. ICASSP\u201999, vol. 1, pp. 345-348, (1999).\n\n[3] Z. Ghahramani and M. J. Beal, (cid:147)Variational Inference for Bayesian Mixtures of Factor Analyz-\n\ners,(cid:148) NIPS12, MIT Press, (2000).\n\n[4] J. J. Odell, (cid:147)The Use of Context in Large Vocabulary Speech Recognition,(cid:148) PhD thesis, Cam-\n\nbridge University, (1995).\n\n[5] K. Shinoda and T. Watanabe, (cid:147)Acoustic Modeling Based on the MDL Principle for Speech\n\nRecognition,(cid:148) Proc. EuroSpeech\u201997, vol. 1, pp. 99-102, (1997).\n\n[6] N. Ueda and Z. Ghahramani, (cid:147)Bayesian Model Search for Mixture Models Based on Optimizing\n\nVariational Bounds,(cid:148) Neural Networks, vol. 15, pp. 1223-1241, (2002).\n\n[7] S. Watanabe et. al., (cid:147)Constructing Shared-State Hidden Markov Models Based on a Bayesian\n\nApproach,(cid:148) Proc. ICSLP\u201902, vol. 4, pp. 2669-2672, (2002).\n\n[8] S. Waterhouse et. al., (cid:147)Bayesian Methods for Mixture of Experts,(cid:148) NIPS8, MIT Press, (1995).\n\n\f", "award": [], "sourceid": 2174, "authors": [{"given_name": "Shinji", "family_name": "Watanabe", "institution": null}, {"given_name": "Yasuhiro", "family_name": "Minami", "institution": null}, {"given_name": "Atsushi", "family_name": "Nakamura", "institution": null}, {"given_name": "Naonori", "family_name": "Ueda", "institution": null}]}