{"title": "Connectionist Speaker Normalization with Generalized Resource Allocating Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 865, "page_last": 874, "abstract": null, "full_text": "Connectionist Speaker Normalization \n\nwith Generalized \n\nResource Allocating Networks \n\nCesare Furlanello \nIstituto per La Ricerca \nScientifica e Tecnologica \n\nPovo (Trento), Italy \nfurlan\u00ablirst. it \n\nDiego Giuliani \n\nIstituto per La Ricerca \nScientifica e Tecnologica \n\nPovo (Trento), Italy \ngiuliani\u00ablirst.it \n\nEdmondo Trentin \nIstituto per La Ricerca \nScientifica e Tecnologica \n\nPovo (Trento), Italy \ntrentin\u00ablirst.it \n\nAbstract \n\nThe paper presents a rapid speaker-normalization technique based \non neural network spectral mapping. The neural network is used \nas a front-end of a continuous speech recognition system (speaker(cid:173)\ndependent, HMM-based) to normalize the input acoustic data from \na new speaker. The spectral difference between speakers can be \nreduced using a limited amount of new acoustic data (40 phonet(cid:173)\nically rich sentences). Recognition error of phone units from the \nacoustic-phonetic continuous speech corpus APASCI is decreased \nwith an adaptability ratio of 25%. We used local basis networks of \nelliptical Gaussian kernels, with recursive allocation of units and \non-line optimization of parameters (GRAN model). For this ap(cid:173)\nplication, the model included a linear term. The results compare \nfavorably with multivariate linear mapping based on constrained \northonormal transformations. \n\n1 \n\nINTRODUCTION \n\nSpeaker normalization methods are designed to minimize inter-speaker variations, \none of the principal error sources in automatic speech recognition. Training a speech \nrecognition system on a particular speaker (speaker-dependent or SD mode) gen(cid:173)\nerally gives better performance than using a speaker-independent system, which is \n\n\f868 \n\nCesare Furlanello. Diego Giuliani. Edmondo Trentin \n\ntrained to recognize speech from a generic user by averaging over individual dif(cid:173)\nferences. On the other hand, performance may be dramatically worse when a SD \nsystem \"tailored\" on the acoustic characteristics of a speaker (the reference speaker) \nis used by another one (the new or target speaker). Training a SD system for any \nnew speaker may be unfeasible: collecting a large amount of new training data \nis time consuming for the speaker and unacceptable in some applications. Given \na pre-trained SD speech recognition system, the goal of normalization methods is \nthen to reduce to a few sentences the amount of training data required from a new \nspeaker to achieve acceptable recognition performance. The inter-speaker variation \nof the acoustic data is reduced by estimating a feature vector transformation be(cid:173)\ntween the acoustic parameter space of the new speaker and that of the reference \nspeaker (Montacie et al., 1989; Class et al., 1990; Nakamura and Shikano, 1990; \nHuang, 1992; Matsukoto and Inoue, 1992). This multivariate transformation, also \ncalled spectral mapping given the type of features considered in the parameteri(cid:173)\nzation of speech data, provides an acoustic front-end to the recognition system. \nSupervised speaker normalization methods require that the text of the training ut(cid:173)\nterances required from the new speaker is known, while arbitrary utterances can \nbe used by unsupervised methods (Furui and Sondhi, 1991). Good performance \nhave been achieved with spectral mapping techniques based on MSE optimization \n(Class et al., 1990; Matsukoto and Inoue, 1992). Alternative approaches presented \nestimation of the spectral normalization mapping with Multi-Layer Perceptron neu(cid:173)\nral networks (Montacie et al., 1989; Nakamura and Shikano, 1990; Huang, 1992; \nWatrous, 1994). \nThis paper introduces a supervised speaker normalization method based on neural \nnetwork regression with a generalized local basis model of elliptical kernels (General(cid:173)\nized Resource Allocating Network: GRAN model). Kernels are recursively allocated \nby introducing the heuristic procedure of (Platt, 1991) within the generalized RBF \nschema proposed in (Poggio and Girosi, 1989). The model includes a linear term \nand efficient on-line optimization of parameters is achieved by an automatic dif(cid:173)\nferentiation technique. Our results compare favorably with normalization by affine \nlinear transformations based on orthonormal constrained pseudoinverse. In this pa(cid:173)\nper, the normalization module was integrated and tested as an acoustic front-end for \nspeaker-dependent continuous speech recognition systems. Experiments regarded \nphone units recognition with Hidden Markov Model (HMM) recognition systems. \n\nThe diagram in Figure 1 outlines the general structure of the experiment with \nGRAN normalization modules. The architecture is independent from the specific \nspeech recognition system and allows comparisons between different normalization \ntechniques. The GRAN model and a general procedure for data standardization are \ndescribed in Section 2 and 3. After a discussion of the spectral mapping problem \nin Section 4, the APASCI corpus used in the experiments and the characteristics \nof the acoustic data are described in Section 5. The recognition system and the \nexperiment set-up are detailed in Sections 6-8. Results are presented and discussed \nin Section 9. \n\n\fConnectionist Speaker Normalization with Generalized Resource Allocating Networks \n\n869 \n\nDataBase: \n\nreference phrase \n\nphraseS \n\n(Yj } j-I \u2022...\u2022 ] \n\nDynamic Time Warping \n\nTraining \n\nI (Xi(t), Yj(t}} \n-' \n\nNeural Network \nsupervised training \n\n: \n\nfx) i-I \u2022...\u2022 I \n\n' - - - - - - - - - - - - - - - - - - - \"1 GRAN normalizati \n\nTest \n\nFeature Extraction \n\nSpeech Signal \ncorresponding to phrase S \nuttered by a new speaker \n\nOutput \n\nFigure 1: System overview \n\n2 THE GRAN MODEL \n\nFeedforward artificial neural networks can be regarded as a convenient realization \nof general functional superpositions in terms of simpler kernel functions (Barron \nand Barron, 1988). With one hidden layer we can implement a multivariate su-\nperposition f(z) = Ef=o cxjKj(z,wj) where Kj is a function depending on an \ninput vector z and a parameter vector Wj, a general structure which allows to re(cid:173)\nalize flexible models for multivariate regression. We are interested in the schema: \ny = H K(x) + Ax + b with input vector x E Rd 1 and estimated output vec(cid:173)\ntor y E R 2 . K = (Kj) is a n-dimensional vector of local kernels, H is the \nd2 x n real matrix of kernel coefficients, b E R d 2 is an offset term and A is a \nd2 x d1 linear term. Implemented kernels are Gaussian, Hardy multiquadrics, in(cid:173)\nverse of Hardy multiquadrics and Epanenchnikov kernels, also in the N adaraya(cid:173)\nWatson normalized form \nrecursive procedure: if appropriate novelty conditions are satisfied for the exam(cid:173)\nple (x', y/), a new kernel Kn+1 is allocated and the new estimate Yn+l becomes \nYn+l (x) = Yn(X) + Kn+1 (llx - x'llw)(y' - Yn(X)) (HardIe, 1990). Global proper(cid:173)\nties and rates of convergence for recursive kernel regression estimates are given in \n(Krzyzak, 1992). The heuristic mechanism suggested by (Platt, 1991) has been \nextended to include the optimization of the weighted metrics as requested in the \ngeneralized versions of RBF networks of (Poggio and Girosi, 1989). Optimization \nregards kernel coefficients, locations and bandwidths, the offset term, the coeffi(cid:173)\ncient matrix A if considered, and the W matrix defining the weighted metrics in \nthe input space: IIxll~ = xtwtWx. Automatic differentiation is used for efficient \non-line gradient-descent procedure w.r. t. different error functions (L2, L1, entropy \nfit), with different learning rates for each type of parameters. \n\n(HardIe, 1990). The kernel allocation is based on a \n\n\f870 \n\nCesare FurLanello, Diego GiuLiani, Edmondo Trentin \n\nX - - - - - - - - - - -+ \" Y \n\nIj;-::=