{"title": "Modeling Time Varying Systems Using Hidden Control Neural Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 147, "page_last": 154, "abstract": null, "full_text": "Modeling Time Varying Systems \n\nUsing Hidden Control Neural Architecture \n\nEsther Levin \n\nAT&T Bell Laboratories \n\nSpeech Research Department \nMurray Hill, NJ 07974 USA \n\nABSTRACT \n\nMulti-layered neural networks have recently been proposed for non(cid:173)\nlinear prediction and system modeling. Although proven successful \nfor modeling time invariant nonlinear systems, the inability of neural \nnetworks to characterize temporal variability has so far been an \nobstacle in applying them to complicated non stationary signals, such \nas speech. In this paper we present a network architecture, called \n\"Hidden Control Neural Network\" (HCNN), for modeling signals \ngenerated by nonlinear dynamical systems with restricted time \nvariability. The approach taken here is to allow the mapping that is \nimplemented by a multi layered neural network to change with time \nas a function of an additional control input signal. This network is \ntrained using an algorithm that is based on \"back-propagation\" and \nsegmentation algorithms for estimating the unknown control together \nwith the network's parameters. The HCNN approach was applied to \nseveral tasks including modeling of time-varying nonlinear systems \nand speaker-independent recognition of connected digits, yielding a \nword accuracy of 99.1 %. \n\nL INTRODUCTION \nLayered networks have attracted considerable interest in recent years due to their \nability to model adaptively nonlinear multivariate functions. It has been recently proved \nin [1]. that a network with one intennediate layer of sigmoidal units can approximate \narbitrarily well any continuous mapping. However, being a static model, a layered \nnetwork is not capable of modeling signals with an inherent time variability, such as \nspeech. \nIn this paper we present a hidden control neural network that can implements non(cid:173)\nlinear and time-varying mapping. The hidden control input signal which allows the \nnetwork's mapping to change over time, provides the ability to capture the non(cid:173)\nstationary properties, and learn the underlying temporal structure of the modeled signal. \n\n147 \n\n\f148 \n\nLevin \n\nII. THE MODEL \n11.1 MULTI LAYERED NETWORK \nMulti layered neural network is a cn,nnectionist models that iwplements a nonlinear \nmapping from and input x E X c R I to an output y EYe R 0: \n\ny = F Q)(x), \n\n(1) \nwhere ro Ene R D , the parameter set of the network, consists of the connection \nwegihts and the biases, and x and y are the activation vectors of the input and output \nlayers, of dimensionality NJ and No, respectively. \nRecently layered networks have proven useful for non-linear prediction of signals and \nsystem modeling [2]. In these applications one uses the values of a real signal x(t), at a \nset of discrete times in the past, to predict x (t) at a point in the future. For example, \nfor order-one-predictor, the output of the network y is used as a predictor of the next \nsignal sample, when the network is given past sample as input, e.g. y =xt =F Q)(Xt-l ), \nwhere xt denotes the predicted value of the signal at time t, which. in general, differs \nfrom the true value, x\" The parameter set of the network ro is estimated from a \ntraining set of discrete time samples from a segment of known signal ( x,. t =0 \u2022...\u2022 T ), \nby minimizing a prediction error which measures the distortion between the signal and \nthe prediction made by the network, \nT \n\nE(ro)= L II xt-F Q)(xt-d II 2, \nand the estimated parameter set ro is given by argmin E ( ro ). \nIn [2] such a neural network predictor is used for modeling chaotic series. One of the \nexamples considered in [2] is prediction of time series generated by the classic logistic. \nor Feigenbaum. map. \n\nt=1 \n\no \n\n(2) \n\nXt+l =4'b'xt (1-xt ) \n\n(3) \nThis iterated map produces an ergodic chaotic time series when b is chosen to equal 1. \nAlthough this time series passes virtually every test for randomness, it is generated by \nthe deterministic Eq.(3), and can be predicted perfectly, once the generating system (3) \nis learned. Using the back-propagation algorithm [3] to minimize the prediction error \n(2) defined on a set of samples of this time series, the network parameters ro were \nadjusted, enabling accurate prediction of the next point Xt+l in this \"random\" series \ngiven the present point Xt as an input. The mapping F Q) implemented by the trained \nnetwork approximated very closely the logistic map (3) that generated the modeled \nseries. \n11.2 IDDDEN CONTROL NETWORK \nFor a given fixed value of the parameters ro, a layered network implements a fixed \ninput-output mapping, and therefore can be used for time-invariant system modeling or \nprediction of signals generated by a fixed, time-invariant system. Hidden control \nnetwork that is based on such layered network. has an additional mechanism that \nallows the mapping (1) to change with time, keeping the parameters ro fixed. We \nconsider the case where the units in the input layer are divided into two distinct \ngroups. The first input unit group represents the observable input to the network, \nx E X c R p. and the second represents a control signal C E C c R q , P + q =NJ, that \ncontrols the mapping between the observable input x, and the network output y. \nThe output of the network y is given. according to (I), by F Q)(x \u2022 c) , where (x, c) \ndenotes the concatenation of the two inputs. We focus on the mapping between the \nobservable input x and the output. This mapping is modulated by the control input c : \n\n\fModeling Time Varying Systems Using Hidden Control Neural Architecture \n\n149 \n\nfor a fixed value of x and for different values of c. the network produces different \noutputs. For a fixed control input. the network implements a fixed observable input(cid:173)\noutput mapping. but when the control input changes. the network's mapping changes \nas well. modifying the characteristics of the observed signal: \n\n~ \n\ny=Fm(x.c) = Fm,c(x). \n\n(4) \nIf the control signal is known for all time t. there is no point in distinguishing between \nthe observable input, X. and the control input c. The more interesting situation is when \nthe control signal is unknown or hidden. i.e.. the hidden control case. which we will \ntreat in this paper, \nThis model can be used for prediction and modeling of nonstationary signals generated \nby time-varying sources, In the case of first order prediction the present value of the \nsignal x, is predicted based on x,-1. with respect to the control input c,. If we restrict \nthe control signal to take its values from a finite set. c E {C 1. \" . \u2022 CN \n} == C. then \nthe network is a finite state networ~ where in each state it implements a fixed input(cid:173)\noutput mapping F m,C.' Such a network with two or more intennidiate layers can \napproximate arbitrarily closely any set (F 1. .,. .F N} of continuous functions of the \nobservable input x [4]. \nIn the applications we considered for this model. two types of time-structures were \nused. namely \nFully connected model: In this type of HCNN. every state. corresponding to a specific \nvalue of the control input. can be reached from any other state in a single time step. It \nmeans that there are no temporal restrictions on the control signal. and in each time \nstep. it can take any of its N possible values { C 1 \u2022 \u2022\u2022\u2022\u2022 CN }. For example. a 2 state fully \nconnected model is shown in Fig. la. In a generative mode of operation. when the \nobservable input of the network is wired to be the the previous network's output \u2022 the \nobservable signal x(t) is generated in each one of the states by a different dynamics: \nx'+1=Fc,(x,). C, E {O. I}. and therefore this network emulates two different dynamical \nsystems. with the control signal acting as a switch between them. \nLeft-to-right model: For spoken word modeling. we will consider a finite-state. left(cid:173)\nto-right HCNN (see Fig.lb). where the control signal is further restricted to take value \nCi only if in the previous time step it had a value of Ci or Ci - 1\u2022 Each state of this \nnetwork represents an unspecified acoustic unit. and due to the \"left-to-right\" structure. \nthe whole word is modeled as concatenation of such acoustic units. The time spent in \neach of the states is not fixed. since it varies according to the value of the control \nsignal. and therefore the model can take into account the duration variability between \ndifferent utterances of the same word. \n\nF. \n\nFigure 1: a-Fully connected 2 state HCNN ; \n\nb-Left to right 8 state HCNN for word modeling. \n\n\f150 \n\nLevin \n\nITI. USING HCNN \nGiven the predictive fonn of HCNN described in the previous section, there are three \nbasic problems of interest that must be solved for the model to be useful in real-world \napplications. This problems are the following: \n\nSegmentation problem : Here we attempt to uncover the hidden part of the model, \nt =<> \u2022...\u2022 T ), to find the \nLe., given a network ro and a sequence of observations ( X\" \ncorrect control sequence, which best explains the observations. This problem is solved \nusing an optimality criterion. namely the prediction error. similar to Eq.(2). \n\nE(ro.cD==l: II x,-Fm,c,(x,-d 112. \n\nT \n\n1=1 \n\n(5) \n\nwhere cf denotes the control sequence Cl \u2022 \u2022\u2022\u2022\u2022 CT. Ci e C. For a given network. rot \nthe prediction error (5) is a function of the hidden control input sequence. and thus \nsegmentation is associated with the minimization: \n\nc;==argminE(ro.cf). \n\ncT \n\n(6) \n\nIn the case of a finite-state. fully connected model. this minimization can be perfonned \nexhaustively, by minimizing for each observation separately. and for a fully connected \nHCNN with a real-valued control signal (i.e. not the finite state case), \nlocal \nminimization of (5) can be perfonned using the back-propagation algorithm. For a \n\"left-to-right\" model , global minimum of (5) is attained efficiently using the Viterbi \nalgorithm [5]. \n\nEvaluation problem, namely how well a given network ro matches a given sequence \nof observations { x,, t =<> \u2022... , T }. The evaluation is a key point for many applications. \nFor example. if we consider the case in which we are trying to choose among several \ncompeting networks, that represent different hypothesis in the hypotheses space, the \nsolution to Problem 2 allows us to choose the network that best matches the \nobservation. This problem is also solved using the prediction error defined in (5). The \nmatch, or actually, the distortion, is measured by the prediction error of the network on \na sequence of observations, for the best possible sequence of hidden control inputs, i.e .\u2022 \n(7) \n\nE(ro)==minE(ro.cf). \n\ncT \n\nTherefore. to evaluate a network. first the segmentation problem must be solved. \n\nTraining problem, i.e.. how to adjust the model parameters ro to best match the \nobservation sequence. or training set. (x\" 1=0 \u2022 .... T ). \nThe training in layered networks is accomplished by minimizing the prediction error of \nEq.(2) using versions of the back-propagation algorithm. \nIn the HCNN case, the \nprediction error (5) is a function of the hidden parameters and the hidden control input \nsequence. and thus training is associated with the joint minimization: \n\nro==argmin{minE(ro.c[)} . \n\nQ \n\ncT \n\n(8) \n\nThis minimization is perfonned by an iterative training algorithm. \nThe k-th iteration of the algorithm consists of two stages: \n1. Reestimation: For the present value of the control input sequence. the prediction \nerror is minimized with respect to the network parameters. \n(ro)k==argminE(ro. (C[h-l) \n\n(9) \n\na \n\n\fModeling Time Varying Systems Using Hidden Control Neural Architecture \n\n151 \n\nThis minimization is implemented by the back-propagation algorithm. \n2. Segmentation: Using the values of parameters. obtained from the previous stage. \nthe control sequence is estimated (as in (6) ). \n\n(c[}k=argminE\u00abro)A: .c[) \n\nc T \n\n(10) \n\nIV. HCNN AS A STATISTICAL MODEL \nFor further understanding of the properties of the proposed model and the training \nprocedure. it is useful to describe the HCNN by an equivalent statistical vector source \nof the following form: \n\nx, = Fm,c,(X,-l )+n,. n,-N(O.J). \n\n(11) \nwhere n, is a white Gaussian noise. Assuming for simplicity that all the values of the \ncontrol allowed by the model are equiprobable (this is a special case of Markov \nprocess. and can be easily extended for the general case) \u2022 we can write the joint \nlikelihood of the data and the control \npT \n\nT \n\np(xLc[ I ro)=(27t)-T exp[-~ L II x,-F CIl,c,(Xr-1) 11 2]. \n\n(12) \n\n,=1 \n\nwhere xf denotes the sequence of observation {x 1. X2. . \u2022. .XT}' \nEq.(12) provides a probabilistic interpretation of the procedures described in the \nprevious section: \nThe proposed segmentation procedure is equivalent to choosing the most probable \ncontrol sequence. given the network and the observations. \nThe evaluation of the network is related to the probability of the observations given the \nmodel. for the best sequence of control inputs. \n\nmin E (ro. cD <=> max P (x[. cf I ro) \u2022 \nc T \n\nc T \n\n(13) \n\nThe proposed training procedure (Eq. 8) is equivalent to maximization of the joint \nlikelihood (12): \n\n&=argmin{minE(ro.s[)) =argmax{maxP (xL c[ I ro)). \n\n(14) \n\n11 \n\nc T \n\na \n\nc T \n\nc T \n\nThus (8) is equivalent to an approximate maximum likelihood training. where instead \nof maximizing the marginal likelihood P (x[ I co)= I:P (xL c[ I ro). only the \nmaximal term in the sum. the joint likelihood (14) is considered. The approximate \nmaximum likelihood training avoids the computational complexity of the exact \nmaximum likelihood approach. and recently [6] was shown to yield results similar to \nthose obtained by the exact maximum likelihood training. \nIV.1 HCNN and the Hidden Markov Model (HMM) \nDuring the past decade hidden Markov modeling has been used extensively to \nrepresent the probability distribution of spoken words [7]. A hidden Markov model \nassumes that the modeled speech signal can be characterized as being produced at each \ntime instant by one of the states of a finite state source. and that each observation \nvector is an independent sample according to the probability distribution of the current \nstate. The transitions between the states of the model are governed by a Markov \nprocess \nHCNN can be viewed as an extension of this model to the case of Markov output \nprocesses. The observable signal in each state is modeled as though it was produced by \n\n\f152 \n\nLevin \n\na dynamical system driven by noise. Here we are modeling the dynamics that \ngenerated the signal. F 0). and the dependence of the present observation vector on the \nprevious one. The assumption that the driving noise (12) is nonnal is not necessary: \ninstead. we can assume a parametric fonn of the noise density. and estimate its \nparameters. \nV. EXPERIMENTAL EVALUATION \nFor experimental evaluation of the proposed model. we tested on two different tasks: \nV.l Time-varying system modeling and segmentation \nHere an HCNN was used for a single-step prediction of a signal generated by a time(cid:173)\nvarying system. described by \n\nFL(Xt) \n\nif switch =0 \n\n{\n\nXt+l = 1-FL(x,) if switch = 1 \u2022 \n\n(15) \n\nwhere FL is the logistic map from Eq. (3). and switch is a random variable. assuming \nbinary values. Both of the systems. FL. and 1-FL\u2022 are chaotic and produce signals in \nthe range [0.1]. A fully connected. 2-state HCNN (each state corresponding to one \nswitch position). as in Fig. 1a. was trained on a segment of 400 samples of such a \nsignal. according to the training algorithm described in section V. The perfonnance of \nthe resulting network was tested on an independent set of 1000 samples of this signal. \nThe estimated control sequence differed from the real switch position in only 8 out of \n1000 test samples. The evaluation score. i.e .\u2022 the average prediction error for this \nestimated control sequence was 7.5xlO-5 per sample. Fig. 2 compares the mapping \nimplemented by the network in one state. corresponding to control value set to O. and \nthe logistic map for switch =0. Similar results are obtained for c=l and switch=1. \nThese results indicate that the HCNN was indeed able to capture the two underlying \ndynamics that generated the modeled signal. and to learn the switching pattern \nsimultaneously. \n\nFig.2 Comparison of the logistic map and the mapping implemented by HCNN with c=O. \nV.2 Continuous recognition of digit sequences \nHere we tested the proposed HCNN modeling technique on recognition of connected \nspoken versions of the digits. consisting of \"zero\" to \"nine\". and including the word \n\"oh\". recorded from male speakers through a telephone handset and sampled at 6.67 \n\n\fModeling Time Varying Systems Using Hidden Control Neural Architecture \n\n153 \n\nkHz. LPC analysis of order 8 was performed on frames of 45 msec duration, with \noverlap of 15 msec, and 12 cepstral and 12 delta cepstral [8] coefficients were derived \nfor the t-th frame to form the observable signal X\" Each digit was modeled by an 8 \nstate,left-to-right RCNN, as in Fig.1b. The network was trained to predict the cepstral \nand delta cepstral coefficients for the next frame. Each network consisted of 32 input \nunits (24 to encode Xt and 8 for a distributed representation of the 8 control values), 24 \noutput units and 30 hidden units, all fully connected. Each network was trained using \na training set of 900 utterances from 44 male speakers extracted from continuous \nstrings of digits using an HMM based recognizer [9]. 1666 strings (5600 words), \nuttered by an independent set of 22 male speakers were used for estimating the \nrecognition accuracy. The mean and the covariance of the driving noise (12) were \nmodeled. The word accuracy obtained was 99.1 %. \nFig. 3a illustrates the process of recognition (the forward pass of Viterbi algorithm) of \nthe word \"one\" by the speaker-independent system. The horizontal axis is time (in \nframes). 11 models from \"zero\" to \"nine\" , and \"oh\" appear on the vertical axis. The \nnumbers that appear in the graph (from 1 to 8) describe the number of a state. For \nexample, number 2 inside the second row of the graph denotes state number 2 of the \nmodel of the word \"one\". In each frame, the prediction error was calculated for each \none of the states in each model, resulting in 88 different prediction errors. The graph \nin each frame shows the states of the models that are in the vicinity of the minimal \nerror among those 88. This is a partial description of a forward pass of the Viterbi \nalgorithm in recognition, before the left-to-right constraints of the models are taken \ninto account Figure 3a shows that the main candidate considered in recognition of the \nword \"one\" is the actual model of \"one\", but in the end of the word two spurious \ncandidates arise. The spurious candidates are certain states of the models of \"seven\" \nand \"nine\". Those states are detectors of the nasal 'n' that appears in all these words. \nFigure 3b shows the recognition of a four digit string \"three - five - oh - four\". The \nspurious candidates indicate detectors of certain sounds, common to different words, \nlike in \"four\" and in \"oh\", in \"five\" and in \"nine\", in \"three\", \"six\" and \"eight\" . \n\n... \n\n...(cid:173)--. \n\n--\n\n\"-\" \n\n.. _-............ \" ....... . \n\nFig. 3 Illustration of the recognition process. \n\n\f154 \n\nLevin \n\nVI. SUMMARY AND DISCUSSION \nThis paper introduces a generalization of the layered neural network that can \nimplement a time-varying non-linear mapping between its observable input and output. \nThe variation of the network's mapping is due to an additional, hidden control input, \nwhile the network parameters remain unchanged. We proposed an algorithm for \nfinding the network parameters and the hidden control sequence from a training set of \nexamples of observable input and output. This algorithm implements an approximate \nmaximum likelihood estimation of parameters of an equivalent statistical model, when \nonly the dominant control sequence is taken into account. The conceptual difference \nbetween the proposed model and the HMM is that in the HMM approach, the \nobservable data in each of the states is modeled as though it was produced by a \nmemoryless source, and a parametric description of this source is obtained during \ntraining, while in the proposed model the observations in each state are produced by a \nnon-linear dynamical system driven by noise, and both the parametric form of the \ndynamics and the noise are estimated. The perfonnance of the model was illustrated \nfor the tasks of nonlinear time-varying system modeling and continuously spoken digit \nrecognition. The reported results show the potential of this model for providing high \nperformance speech recognition capability. \n\nAcknowledgment \n\nSpecial thanks are due to N. Merhav for numerous comments and helpful discussions. \nUseful discussions with N.Z. Tishby, S.A. Solla, L.R. Rabiner and J.G. Wilpon are \ngreatly appreciated. \n\nReferences \n1. G. Cybenko, \" Approximation by superposition of sigmoidal function,\" Math. \nControl Systems Signals. in press, 1989. \n2. A. Lapedes and R. Farber, \" Nonlinear signal processing using neural networks: \nprediction and system modeling ... Proc of IEEE. in press, 1989. \n3. D.E. Rumelhart. G.E. Hinton and R.J. Williams, \"Learning internal representation \nby error propagation,\" Parallel Distributed Processing: Exploration \nthe \nMicrostructure of Cognition. MIT Press, 1986. \n4. E. Levin. \"Word recognition using hidden control neural architecture,\" Proc. of \nICASSP. Albuquerque. April 1990. \n5. G.D. Forney. \"The Viterbi algorithm,\" Proc. IEEE. vol. 61. pp. 268-278, Mar. \n1973. \n6. N. Merhav and Y. Ephraim. \"Maximum likelihood hidden Markov modeling using \na dominant sequence of states.\" accepted for publication in IEEE Transaction on \nASSP. \n7. L. R. Rabiner, \"A tutorial on hidden Markov models and selected applications in \nspeech recognition,\" Proc. of IEEE, vol. 77, No.2, pp. 257-286, February 1989 \n8. B.S. Atal, \"Effectiveness of linear prediction characteristics of the speech wave for \nautomatic speaker identification and verification,\" J. Acoust. Soc. Am., vol. 55, No.6, \npp. 1304-1312, June 1974. \n9. L.R. Rabiner, J.G. Wilpon, and F.K. Soong, \"High performance connected digit \nrecognition using hidden Markov models,\" IEEE Transaction on ASSP. vol. 37, 1989. \n\nin \n\n\f", "award": [], "sourceid": 363, "authors": [{"given_name": "Esther", "family_name": "Levin", "institution": null}]}