{"title": "A Micropower Analog VLSI HMM State Decoder for Wordspotting", "book": "Advances in Neural Information Processing Systems", "page_first": 727, "page_last": 733, "abstract": null, "full_text": "A Micropower Analog VLSI \n\nHMM State Decoder for Wordspotting \n\nJohn Lazzaro and John Wawrzynek \n\nCS Division, UC Berkeley \nBerkeley, CA 94720-1776 \n\nlazzaroGcs.berkeley.edu. johnwGcs.berkeley.edu \n\nRichard Lippmann \nMIT Lincoln Laboratory \n\nRoom S4-121, 244 Wood Street \n\nLexington, MA 02173-0073 \n\nrplGsst.ll.mit.edu \n\nAbstract \n\nWe describe the implementation of a hidden Markov model state \ndecoding system, a component for a wordspotting speech recogni(cid:173)\ntion system. The key specification for this state decoder design is \nmicrowatt power dissipation; this requirement led to a continuous(cid:173)\ntime, analog circuit implementation. We characterize the operation \nof a 10-word (81 state) state decoder test chip. \n\n1. INTRODUCTION \n\nIn this paper, we describe an analog implementation of a common signal processing \nblock in pattern recognition systems: a hidden Markov model (HMM) state decoder. \nThe design is intended for applications such as voice interfaces for portable devices \nthat require micropower operation. In this section, we review HMM state decoding \nin speech recognition systems. \n\nAn HMM speech recognition system consists of a probabilistic state machine, and \na method for tracing the state transitions of the machine for an input speech wave(cid:173)\nform. Figure 1 shows a state machine for a simple recognition problem: detect(cid:173)\ning the presence of keywords (\"Yes,\" \"No\") in conversational speech (non-keyword \nspeech is captured by the \"Filler\" state). This type of recognition where keywords \nare detected in unconstrained speech is called wordspotting (Lippmann et al., 1994). \n\n\f728 \n\n1. Lazzaro. 1. Wawrzynek and R. Lippmann \n\nFigure 1. A two-keyword (\"Yes,\" states 1-10, \"No,\" states 11-20) HMM. \n\nFiller \n\nOur goal during speech recognition is to trace out the most likely path through this \nstate machine that could have produced the input speech waveform. This problem \ncan be partially solved in a local fashion, by examining short (80 ms. window) \noverlapping (15 ms. frame spacing) segments of the speech waveform. We estimate \nthe probability bi(n) that the signal in frame n was produced by state i, using static \npattern recognition techniques. \n\nTo improve the accuracy of these local estimates, we need to integrate information \nover the entire word. We do this by creating a set of state variables for the machine, \ncalled likelihoods, that are incrementally updated at every frame. Each state i has \na real-valued likelihood i[n]) \n\nL...J \n(4) \n\n(b) \n\nVm log[if>i[nlJ \n\n(c) \n\nVi(t) \n\nL...J \n(4) \n\nFigure 3. (a) Analog discrete-time single-state decoder. (b) Enhanced version of \n(a), includes the renormalization system. (c) Continuous-time extension of (b). \n\n\f730 \n\n1. Lazzaro. 1. Wawrzynek and R. Lippmann \n\nEquation 1 uses two types of variables: probabilities and log likelihoods. In the \nimplementation shown in Figure 3, we choose unidirectional current as the signal \ntype for probability, and large-signal voltage as the signal type for log likelihood. \nWe can understand the dimensional scaling of these signal types by analyzing the \nfloating-well transistor labeled (4) in Figure 3a. The equation \n\nIh \nVm log(\u00a2i(n)) = Vm log(bi(n)) + gi(n -1) + Vm log( T) \n\no \n\n(2) \n\ndescribes the behavior of this transistor, where Vm = (VoIKp) In(10), gi(n - 1) is \nthe output of the delay element, and 10, K and Vo are MOS parameters. Both 10 \nand K in Equation 2 are functions of V.b . However, the floating-well topology of the \ntransistor ensures V. b = 0 for this device. \nThe input probability bj(n) is scaled by the unidirectional current h, defining the \ncurrent flowing through the transistor. The current h is the largest current that \nkeeps the transistor in the weak-inversion regime. We define I, to be the small(cid:173)\nest value for hbi(n) that allows the circuit to settle within the clock period. The \nratio Ihl I, sets the supported range of bi(n). In the test-chip fabrication process, \nIhl I, ~ 10,000 is feasible, which is sufficient for accurate wordspotting. Likewise, \nthe unitless log(\u00a2i(n)) is scaled by the voltage Vm to form a large-signal voltage \nencoding of log likelihood. A nominal value for Vm is 85m V in the test-chip pro(cid:173)\ncess. To support a log likelihood range of 35 (the necessary range for accurate \nwordspotting) a large-signal voltage range of 3 volts (i.e. 35Vm ) is required. \n\nThe term gi(n - 1) in Equation 2 is shown as the output of the circuit labeled \n(1) in Figure 3a. This circuit computes a function that approximates the desired \nexpression Vmlog(\u00a2i(n - 1) + \u00a2i-l(n -1)), if the transistors in the circuit operate \nin the weak-inversion regime. \nThe computed log likelihood log( \u00a2i (n)) in Equation 1 decreases every frame. The \ncircuit shown in Figure 3a does not behave in this way: the voltage Vmlog(\u00a2j(n)) \nincreases every frame. This difference in behavior is attributable to the constant \nterm Vm log(IhIIo) in Equation 2, which is not present in Equation 1, and is always \nlarger than the negative contribution from Vm log(bj(n)). Figure 3b adds a new \ncircuit (labeled (2)) to Figure 3a, that allows the constant term in Equation 2 to \nbe altered under control of the binary input V. If V is Vdd , the circuit in Figure 3b \nis described by \n\nVm log(\u00a2j(n)) = Vm log(bj(n)) + gi(n - 1) + Vm log( 1;;0), \n\nv \n\n(3a) \n\nwhere the term Vmlog((hIo)II;) should be less than or equal to zero. If V is \ngrounded, the circuit is described by \n\nVm log(\u00a2j(n)) = Vm log(bj(n)) + gi(n - 1) + Vm log( T)' \n\nIh \n\nv \n\n(3b) \n\nwhere the term Vm log(Ihl Iv) should have a positive value of at least several hundred \nmillivolts. The goal of this design is to create two different operational modes for \nthe system. One mode, described by Equation 3a, corresponds to the normal state \ndecoder operation described in Equation 1. The other mode, described by Equation \n\n\fA Micropower Analog VLSI HMM State Decoder for Wordspotting \n\n731 \n\n3b, corresponds to the renormalization procedure, where a positive constant is added \nto all likelihoods in the system. During operation, a control system alternates \nbetween these two modes, to manage the dynamic range of the system. \n\nSection 1 formulated HMMs as discrete-time systems. However, there are significant \nadvantages in replacing the z-t element in Figure 3b with a continuous-time delay \ncircuit. The switching noise of a sampled delay is eliminated. The power consump(cid:173)\ntion and cell area specifications also benefit from continuous-time implementation. \n\nFundamentally, a change from discrete-time to continuous-time is not only an im(cid:173)\nplementation change, but also an algorithmic change. Figure 3c shows a continuous(cid:173)\ntime state decoder whose observed behavior is qualitatively similar to a discrete-time \ndecoder. The delay circuit, labeled (3), uses a linear transconductance amplifier in \na follower-integrator configuration. The time constant of this delay circuit should \nbe set to the frame rate of the corresponding discrete-time state decoder. \n\nFor correct decoder behavior over the full range of input probability values, the \ntransconductance amplifer in the delay circuit must have a wide differential-input(cid:173)\nvoltage linear range. In the test chip presented in this paper, an amplifier with a \nsmall linear range was used. To work around the problem, we restricted the input \nprobability currents in our experiments to a small multiple of II. \n\nFigure 4 shows a state decoding system that corresponds to the grammar shown \nin Figure 1. Each numbered circle corresponds to the circuit shown in Figure 3c. \nThe signal flows of this architecture support a dense layout: a rectangular array of \nsingle-state decoding circuits, with input current signal entering from the top edge \nof the array, and end-state log likelihood outputs exiting from the right edge of the \narray. States connect to their neighbors via the Vi-l(t) and Vi(t) signals shown in \nFigure 3c. For notational convenience, in this figure we define the unidirectional \ncurrent Pi(t) to be Ihbi{t). \n\nIn addition to the single-state decoder circuit, several other circuits are required. \nThe \"Recurrent Connection\" block in Figure 4 implements the loopback connecting \nthe filled circles in Figure 1. We implement this block using a 3-input version of \nthe voltage follower circuit labeled (1) in Figure 3c. A simple arithmetic circuit \nimplements the \"Word Detect\" block. To complete the system, a high fan-in/fan(cid:173)\nout control circuit implements the renormalization algorithm. The circuit takes \nas input the log likelihood signals from all states in the system, and returns the \nbinary signal V to the control input of all states. This control signal determines \nwhether the single-state decoding circuits exhibit normal behavior (Equation 3a) or \nrenormalization behavior (Equation 3b). \n\nPl Pll P2 Pl2 P3 Pl3 P4 Pl4 Ps PlS P6 Pl6 \n\nP7 Pl7 P8 P18 P9 Pl9 PlO P20P21 \n\nFigure 4. State decoder system for grammar shown in Figure 1. \n\n\f732 \n\n1. Lazzaro, 1. Wawrzynek and R. Lippmann \n\n3. STATE DECODER TEST CHIP \n\nWe fabricated a state decoder test chip in the 21lm, n-well process of Orbit Semi(cid:173)\nconductor, via MOSIS . The chip has been fully tested and is functional. The chip \ndecodes a grammar consisting of eight ten-state word models and a filler state. The \nstate decoding and word detection sections of the chip contain 2000 transistors, \nand measure 586 x 28071lm (586x2807.X, ). = 1.0Ilm). In this section, we show test \nresults from the chip, in which we apply a temporal pattern of probability currents \nto the ten states of one word in the model (numbered 1 through 10) and observe \nthe log likelihood voltage of the final state of the word (state 10). \n\nFigure 5 contains simulated results, allowing us to show internal signals in the \nsystem. Figure 5a shows the temporal pattern of input probability currents PI ... PIO \nthat correspond to a simulated input word. Figure 5b shows the log likelihood \nvoltage waveform for the end-state of the word (state 10). The waveform plateaus \nat L h , the limit of the operating range of the state decoder system. During this \nplateau this state has the largest log likelihood in the system. Figure 5c is an \nexpanded version of Figure 5b, showing in detail the renormalization cycles. Figure \n5d shows the output computed by the \"Word Detect\" block in Figure 4. Note \nthe smoothness of the waveform, unlike Figure 5c. By subtracting the filler-state \nlog likelihood from the end-state log likelihood, the Word Detect block cancels the \ncommon-mode renormalization waveform. \n\nFigure 6 shows a series of four experiments that confirm the qualitative behavior \nof the state decoder system. This figure shows experimental data recorded from \nthe fabricated test chip. Each experiment consists of playing a particular pattern \nof input probability currents PI ... PIO to the state decoder many times; for each \nrepetition, a certain aspect of the playback is systematically varied. We measure the \npeak value of the end state log likelihood during each repetition, and plot this value \nas a function of the varied input parameter. For each experiment shown in Figure \n6, the left plot describes the input pattern, while the right plot is the measured end(cid:173)\nstate log likelihood data. The experiment shown in Figure 6a involves presenting \ncomplete word patterns of varying durations to the decoder. As expected, words \nwith unrealistically short durations have end-state responses below L h , and would \nnot produce successful word detection. \n\nPi....../'--\n-A . . -\n\n---\nG \n\"0 \nP3~ 0 \n0 \n:S \n...:.= \n\n---A---\nP5 ----\"'--\n~ v \nP7---\"-\n~ ;.:3 \nP9----\"--\n\n~ \n0 \n\n400 \n\n(ms) \n\n(a) \n\n> \n\n' - ' \n\"0 3.5 \n0 \n0 \n:S \nv \n...:.= \n;.:3 3.0 \nbO \nj \n\n0.2 \n\n:::-\nI -0.4 \n-;1 \n\n-1.0 \n\n(ms) \n\n(b) \n\n350 \n\n100 \n\n200 \n\n(ms) \n\n(c) \n\n700 \n\n400 \n(ms) \n\n(d) \n\nFigure 5. Simulation of state decoder: (a) Inputs patterns, (b), (c) End-state \nresponse, (d) Word-detection response. \n\n\f~L, I \n\n\"0 3.4 \n0 \n0 \n:5 \n'ii 3. \n~ \n~ \n\n--\n>\" Lh \n\n3.4 / \n\n\"0 \n0 \n0 \n:5 \n'ii 3. \n:3 \n\n~ \n\n1 \n\n400 700 \n\nWord Length (rns) \n\nPl~ \nP3~ \nP5~ \nP7~ \nP9~ \n\nPl~ \nP3~ \nP5~ \nP7 =:f:. \nP9~ \n\nt \n\nA Micropower Analog VLSI HMM State Decoder for Wordspotting \n\n(a) \n\n(b) \n\nPl~ \nP3~ \nP5~ \nP7--\" \nP9 \n\nt \n\nPl \nP3.....,. \nP5~ \nP7~ \nP9~ \n\n733 \n\nF:L[(C) \n\n~ 2.8 \n:3 \n\n1 \n\n10 \nLast State \n\n\"0 o o \n:S \n] \n:3 \n\n(d) \n\nFirst State \n\nFigure 6. Measured chip data for end-state likelihoods for long, short, and incom(cid:173)\nplete pattern sequences. \n\nThe experiment shown in Figure 6b also involves presenting patterns of varying \ndurations to the decoder, but the word patterns are presented \"backwards,\" with \ninput current PIO peaking first, and input current PI peaking last. The end-state \nresponse never reaches L h \u2022 even at long word durations, and (correctly) would not \ntrigger a word detection. \n\nThe experiments shown in Figure 6c and 6d involve presenting partially complete \nword patterns to the decoder. In both experiments, the duration of the complete \nword pattern is 250 ms. Figure 6c shows words with truncated endings, while Figure \n6d shows words with truncated beginnings. In Figure 6c, end-state log likelihood is \nplotted as a function of the last excited state in the pattern; in Figure 6d, end-state \nlog likelihood is plotted as a function of the first excited state in the pattern. In \nboth plots the end-state log likelihood falls below Lh as significant information is \nremoved from the word pattern. \n\nWhile performing the experiments shown in Figure 6, the state-decoder and word(cid:173)\ndetection sections of the chip had a measured average power consumption of 141 \nnW (Vdd = 5v). More generally, however, the power consumption, input probability \nrange, and the number of states are related parameters in the state decoder system. \n\nAcknowledgments \n\nWe thank Herve Bourlard, Dan Hammerstrom, Brian Kingsbury, Alan Kramer, \nNelson Morgan, Stylianos Perissakis, Su-lin Wu, and the anonymous reviewers for \ncomments on this work. Sponsored by the Office of Naval Research (URI-NOOOI4-\n92-J-1672) and the Department of Defense Advanced Research Projects Agency. \nOpinions, interpretations, conclusions, and recommendations are those of the au(cid:173)\nthors and are not necessarily endorsed by the United States Air Force. \n\nReference \n\nLippmann, R. P., Chang, E. I., and Jankowski, C. R. (1994). \"Wordspotter training \nusing figure-of-merit back-propagation,\" Proceedings International Conference on \nAcoustics, Speech, and Signal Processing, Vol. 1, pp. 389-392. \n\n\f", "award": [], "sourceid": 1206, "authors": [{"given_name": "John", "family_name": "Lazzaro", "institution": null}, {"given_name": "John", "family_name": "Wawrzynek", "institution": null}, {"given_name": "Richard", "family_name": "Lippmann", "institution": null}]}