{"title": "Figure of Merit Training for Detection and Spotting", "book": "Advances in Neural Information Processing Systems", "page_first": 1019, "page_last": 1026, "abstract": null, "full_text": "Figure of Merit Training for Detection and \n\nSpotting \n\nEric I. Chang and Richard P. Lippmann \n\nMIT Lincoln Laboratory \n\nLexington, MA 02173-0073, USA \n\nAbstract \n\nSpotting tasks require detection of target patterns from a background of \nrichly varied non-target inputs. The performance measure of interest for \nthese  tasks,  called  the figure  of merit (FOM),  is  the  detection  rate  for \ntarget  patterns  when  the  false  alarm  rate  is  in  an  acceptable  range.  A \nnew approach to training spotters is presented which computes the FOM \ngradient  for  each  input pattern  and  then  directly  maximizes  the  FOM \nusing  b ackpropagati on.  This  eliminates  the  need  for  thresholds  during \ntraining. It also  uses  network resources  to model  Bayesian a posteriori \nprobability  functions  accurately  only  for  patterns  which  have  a \nsignificant effect on  the  detection accuracy over the  false  alarm rate of \ninterest.  FOM  training  increased  detection  accuracy  by  5  percentage \npoints for a hybrid radial basis function  (RBF) - hidden Markov model \n(HMM) wordspotter on the credit-card speech corpus. \n\n1  INTRODUCTION \nSpotting tasks require accurate detection of target patterns from a background of richly var(cid:173)\nied non-target inputs. Examples include keyword spotting from continuous acoustic input, \nspotting cars in satellite images, detecting faults in complex systems over a wide range of \noperating conditions, detecting earthquakes from  continuous seismic  signals,  and finding \nprinted text on images which contain complex graphics. These problems share three com(cid:173)\nmon characteristics. First, the  number of instances of target patterns is  unknown.  Second, \npatterns from background, non-target, classes are varied and often difficult to model accu(cid:173)\nrately. Third, the performance measure of interest, called the figure of merit (FOM), is the \ndetection rate for target patterns when the false alarm rate is over a specified range. \n\nNeural network classifiers are often used for detection problems by training on  target and \nbackground  classes,  optionally  normalizing  target  outputs  using  the  background  output, \n\n1019 \n\n\f1020 \n\nChang and Lippmann \n\nPUTATIVE HITS \n\nnA \n\nuS \n\nI NORMALIZATION AND THRESHOLDING I \n\njl  A \n\nUs \n\n~~  SACKG ROUND \n\nCLASSIFIER \n\nt \n\nINPUT PATTERN \n\nFigure 1.  Block diagram of a spotting system. \n\nand thresholding the resulting score to generate putative hits, as shown in Figure 1. Putative \nhits  in  this  figure  are input patterns which generate normalized scores above a threshold. \nWe  have developed a hybrid radial  basis function  (RBF) - hidden Markov model (HMM) \nkeyword spotter. This wordspotter was  evaluated using the NIST credit card speech data(cid:173)\nbase as  in (Rohlicek,  1993, Zeppenfeld,  1993) using the same train/evaluation split of the \ntraining conversations as was used in  (Zeppenfeld,  1993). The system spots 20 target key(cid:173)\nwords, includes one general filler class, and uses a Viterbi decoding backtrace as described \nin (Lippmann,  1993) to backpropagate errors over a sequence of input speech frames. The \nperformance of this  spotting system and  its improved versions is  analyzed by plotting de(cid:173)\ntection versus false  alarm rate curves as shown in Figure 2. These curves are generated by \nadjusting the classifier output threshold to allow few or many putative hits. Wordspotter pu(cid:173)\ntative hits  used to generate Figure 2 correspond to speech frames  when the difference be(cid:173)\ntween the cumulative log Viterbi scores in output HMM nodes of word and filler models is \nabove a threshold. The FOM for this wordspotter is defined as the average keyword detec(cid:173)\ntion rate when the false  alarm rate ranges from  1 to  10 false alarms per keyword per hour. \nThe 69.7%  figure  of merit for  this  system means that 69.7%  of keyword  occurrences  are \ndetected  on  the  average  while  generating  from  20 to  200  false  alarms  per hour of input \nspeech. \n\n2  PROBLEMS WITH BACKPROPAGATION TRAINING \nNeural network classifiers  used for spotting tasks can be trained using conventional back(cid:173)\npropagation procedures with 1 of N desired outputs and a squared error cost function. This \napproach to training does not maximize the FOM because it attempts to estimate Bayesian \na posteriori probability functions accurately for all inputs even if a particular input has little \neffect on detection accuracy  at false  alarm  rates of interest.  Excessive  network resources \nmay  be  allocated to  modeling  the  distribution  of common  background  inputs  dissimilar \nfrom targets and of high-scoring target inputs which are easily detected. This problem can \nbe addressed by training only when network outputs are above thresholds. This approach is \nproblematic because it is difficult to set the threshold for different keywords, because using \nfixed target values of 1.0 and 0.0 requires careful normalization of network output scores to \nprevent saturation and  maintain backpropagation effectiveness,  and  because  the  gradient \ncalculated from a fixed target value does not reflect the actual impact on the FOM. \n\n\fFigure of Merit Training for Detection and Spotting \n\n1021 \n\nA SPLIT OF CREDIT-CARD \nTRAINING DATA \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 :::.:./.::.:: .. !!! .. ~.~ ......... u::.~ .. .::: .. l - -\n\n\",\"I'I .. ~.~I'iI.I: \n\n100 \n(J)  90 \nz \n0  80 \ni= \n70 \n0 \nW \nt- 60 \nw c \n50 \nt-\nO  40 \nw a::  30 \na:: \n0  20 \n0 \n10 \n~ 0 \n0 \n\n0 \n\n/'-:--\n\n,/, \n\nf / \n./ \n\nFOM BACK-PROP (FOM: 69.7%) \n\nEMBEDDED REESTIMATION (FOM: 64.5%) \n\nISOLATED WORD TRAIN (FOM: 62.5%) \n\n2 \n\n4 \n\n6 \n\n8 \n\n10 \n\nFALSE ALARMS PER KW PER HR \n\nFigure 2.  Detection vs. false alarm rate curve for a 20-word hybrid wordspotter. \n\nFigure 3 shows the gradient of true hits and false alarms when target values are set to be 1.0 \nfor  true hits and 0.0 for false alarms, the output unit is  sigmoidal, and the threshold for a \nputative hit is set to roughly 0.6. The gradient is the derivative of the squared error cost with \nrespect to  the input of the  sigmodal output unit. As can be  seen,  low-scoring hits  or false \nalarms that may affect the FOM are ignored, the gradient is discontinuous at the threshold, \nthe gradient does not fall to zero fast enough at high values, and the relative sizes of the hit \nand false alarm gradients do not reflect the true effect of a hit or false alarm on the FOM. \n\n3  FIGURE OF MERIT TRAINING \nA new approach to training a spotter system called \"figure of merit training\" is to directly \ncompute the FOM and its derivative. This derivative is the change in FOM over the change \nin the output score of a putative hit and can be used instead of the derivative of a squared(cid:173)\nerror or other cost function during training. Since the FOM is calculated by sorting true hits \nand false  alarms  separately for each target class and forming  detection  versus false alarm \ncurves, these measures and their derivatives can not be computed analytically. Instead, the \nFOM and its derivative are computed using fast sort routines. These routines insert a new \n\n0.2  r - - - - - - - - - - - - - - - - - - - - ,  \n\nTHRESHOLD \n\nL....... \n\nHIT GRADIENT \n\n0  I-----------f------==-'-'\"\"!l \n\n!z w \nCi \n\u00ab \na: \n<!) \n\nGRADIENT \n.0.2  L....................L .............. ~_'_'_~ .......... ..L....................J'_'_'_~ ........... ......J...... ............ .....L.<_.'-'--'-J.......o. ........... \n\no \n\n0.1 \n\n0.2 \n\n0.3 \n\n0.5 \n\n0.4 \n0.6 \nOUTPUT VALUE \n\n0.7 \n\n0,8 \n\n0.9 \n\nFigure 3.  The gradient for a sigmoid output unit when the target value for true hits is set to \n\n1.0 and the target value for false alarms is set to 0.0. \n\n\f1022 \n\nChang and Lippmann \n\nputative hit into an already sorted list and calculate the change in the FOM caused by  that \ninsertion. The running putative hit list used to compute the FOM is updated after every new \nputative hit is observed and it must contain all putative hits observed during the most recent \npast training  cycle  through  all  training  patterns.  The  gradient estimate  is  smoothed  over \nnearby putative hit scores to account for the quantized nature of detection versus false alarm \nrate curves. \n\nFigure 4 shows plots of linearly scaled gradients for the 20-word hybrid wordspotter. Each \nvalue on the curve represents the smoothed change in the FOM that occurs when a single \nhit or false alarm with the specified normalized log output score is inserted into the current \nputative hit list. Gradients are positive for putative hits corresponding to true hits and neg(cid:173)\native for false alarms. They also fall off to zero for putative hits with extremely high or low \nscores.  Shapes of these curves vary across words. The relative importance of a hit or false \nalarm, the normalized output score which results in high gradient values, and the shape of \nthe gradient curve varies. Use of a squared error or other cost function with sigmoid output \nnodes would not generate this variety of gradients or automatically identify the range of pu(cid:173)\ntative hit scores where gradients should be high. Application ofFOM training requires only \nthe  gradients  shown  in these curves  with  no  supplementary  thresholds.  Patterns with low \nand high inputs will have a minimal effect during training without using thresholds because \nthey produce gradients near zero. \n\nDifferent keywords have dramatically different gradients. For example, credit-card is long \nand the detection rate is high. The overall FOM thus doesn't change much if more true hits \nare found. A high scoring false alarm, however, decreases the FOM drastically. There is thus \na large negative gradient for false alarms for credit-card. The keywords account and check \nare usually short in duration and thus more difficult to detect, thus any increase in  number \nof true hits strongly increases the overall FOM. On the other hand,  since in this database, \nthe  words account and  check occur much less frequently  than credit-card, a high  scoring \nfalse alarm for the words account and check has less impact on the overall FOM. The gra(cid:173)\ndient  for  false  alarms  for  these  words  is  thus  correspondingly  smaller.  Comparing  the \ncurves in Figure 3 with the fixed prototypical curve in Figure 4 demonstrates the dramatic \ndifferences in  gradients that occur when the gradient is  calculated to  maximize the FOM \ndirectly instead of using a threshold with sigmoid output nodes. \n\n\"ACCOUNT\" \n\n\"CHECK' \n\n\"CREDIT-CARD' \n\n0.3  r - - - - - - - ,  \n\nHIT \n\nFA \n\no \n\n~ \nw \nis  -03 \n\u00ab \nffi \n\n-D.6 \n\n-0. 9  ~--'--.L..-L--'--.L...-L---' \n\n-100 \n\n0 \n\n100  200  300  -100 \n\n0 \n\n100  200  300 -100 \n\n0 \n\n100  200  300 \n\nPUTATIVE HIT SCORE \n\nFigure 4.  Figure of merit gradients computed for true hits (HIT) and false alarms (FA) \nwith scores ranging from -100 to 300 for the keywords account, check, and \ncredit-card. \n\n\fFigure of Merit Training for Detection and Spotting \n\n1023 \n\nFaM training is a general technique that can applied to any \"spotting\" task where a set of \ntargets must be discriminated from background inputs. FaM training was successfully test(cid:173)\ned using the hybrid radial basis function (RBF) - hidden Markov model (HMM) keyword \nspotter described in (Lippmann,  1993). \n\n4  IMPLEMENTATION OF FOM TRAINING \nFaM training  is  applied to  our high-performance HMM wordspotter after forward-back(cid:173)\nward training is complete. Word models in the HMM wordspotter are first used to spot on \ntraining conversations. The FaM gradient of each putative hit is calculated when this hit is \ninserted into the putative hit list. The speech segment corresponding to a putative hit is ex(cid:173)\ncised from  the conversation  speech file  and  the corresponding keyword model is  used  to \nmatch each frame with a particular state in  the model using a Viterbi  backtrace (shown in \nFigure 5.) The gradient is then used to adjust the location of each Gaussian component in a \nnode as in  RBF classifiers (Lippmann,  1993) and also the state weight of each state. The \nstate weight is  a penalty added for each frame assigned to a state. The weight for each in(cid:173)\ndividual state is  adjusted according  to  how important each state is to  the detection of the \nkeyword. For example, many false alarms for the word card are words that sound like part \nof the keyword such as hard or far. The first few states of the card model represent the sound \n/kJ and false alarms stay in these front states only a short time. If the state weight of the first \nfew  states of the card model is large, then a true hit has a larger score than false alarms. \n\nThe putative hit score which is  used to detect peaks representing putative hits is generated \naccording to \n\nS \n\ntota \n\nI  =  Sk \n\neywor \n\nd - S rll \n\nJ r  er \n\n. \n\n(EQ 1) \n\nIn this equation,  Stotal  is the putative hit score,  Skeyword \n\nis the log Viterbi score in the \n\nRAW  KEYWORD SCORE \n\n\u2022 \u2022 \n\u2022  \u2022  \u2022 \n\u2022  \u2022  \u2022  \u2022  \u2022 \n\nV~TERBI ALIGNMENT \n\nRADIAL \nBASIS \nFUNCTIO \nNODES \n\nFigure 5.  State weights and center updates are applied to the state that is matched to each \n\nframe in a Viterbi backtrace. \n\n\f1024 \n\nChang and Lippmann \n\nlast node of a specific  keyword model computed using the Viterbi algorithm from  the be(cid:173)\nginning of the conversation to the frame where the putative hit ended, and  S Ii/Ie r \nis the log \nViterbi  score in  the last node of the  filler model. The filler score is  used to  normalize the \nkeyword score and approximate a posterior probability. The keyword score is calculated us(cid:173)\ning a modified form  of the Viterbi algorithm \n\na. (t + 1)  =  max(a . (t)  + a .. ,  a.  1 (t)  + a. \n\n/ \n\n/ \n\n/,  / \n\n/-\n\n.) + d . (t, x)  + W  .\u2022 \n/ \n\n/ \n\n/-1, / \n\n(EQ2) \n\nThis equation is identical to the normal Viterbi recursion for left-to-right linear word mod(cid:173)\nels after initialization, except the extra state score  wi  is added. In this equation,  a i  (t) \nthe log Viterbi score in  node  i  at time  t,  a j \nnode  i  to node j  ,and  d j  (t, x) \nfeature vector  x  at time  t  . \n\nis \n.  is the log of the transition probability from \nis the log lik~lihood distance score for node  i  for the input \n\nWord  scores  are  computed and a peak-picking algorithm looks for maxima above  a  low \nthreshold. After a peak representing a putative hit is  detected, frames of a putative hit are \naligned with the states in the keyword model using the Viterbi backtrace and both the means \nof Gaussians  in  each  state  and  state  weights  of the  keyword  model  are  modified.  State \nweights are modified according to \n\n(EQ 3) \n\nllstate \n\nduration \n\nis the state weight in node  i  at time  t,  gradient \n\nIn this equation,  Wj (t) \ngradient  for  the  putative  hit, \n\nis the FOM \nis  the  stepsize  for  state  weight  adaptation,  and \nis the number of frames aligned to node  i . If a true hit occurs, and the gradient \nis positive, the state weight is increased in proportion to the number of frames assigned to \na state. If a false alarm occurs, the state weight is reduced in  proportion to the  number of \nframes assigned to a state. The state weight will thus be strongly positive if there are many \nmore frames for a true hit that for a false alarm. It will be strongly negative if there are more \nframes  for a false  alarm than for a true hit.  High state weight values should thus  improve \ndiscrimination between true hits and false alarms. \n\nThe center of the Gaussian components within each node, which are similar to Gaussians \nin radial basis function networks, are modified according to \n\nm .. (t+ 1)  = miJ. (t)  +gradientxllcenterX  J \nV \n\nx.(t) -m .. (t) \n\n/J \n\na .. \n/J \n\n(EQ4) \n\nis the FOM gradient,  llcenter \n\nis the j  th component of the mean vector for a Gaussian hidden \nIn  this equation,  m j . (t) \nnode in HMM state 1 at time  t,  gradient \nis the stepsize \nfor moving Gaussian centers,  x\u00b7 (t) \nis the value of the j  th component of the input feature \nvector at time  t, and  a j .  is the'standard deviation of the j  th component of the Gaussian \nhidden node in HMM st~te i . \nFor each true hit, the centers of Gaussian hidden nodes in a state move toward the observa(cid:173)\ntion  vectors  of frames  assigned  to  a particular state.  For a  false  alarm,  the  centers  move \naway from the observation vectors that are assigned to a particular state. Over time, the cen(cid:173)\nters move closer to the true hit observation vectors and further away from false  alarm ob(cid:173)\nservation vectors. \n\n\fFigure of Merit Training for Detection and Spotting \n\n1025 \n\n0.95  , - - - - - - - - - - - - - - - - - - - - - . ,  \n\n0.9 \n\n0.85 \n\n0.8  \\--_,,-r \n\nFEMALE TRAIN \n\nFOM \n\n0.75 \n\n0.7 \n\n0.65 \n\n0.6 \n\nL---~~~~------~ \n\nMALE TEST \n\n0.55 \n0.5  L..-_.....J-__  .l.....-_--'--__  .J...-_---'-__  ...L...-_--'-_-----' \n\no \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n120 \n\n140 \n\n160 \n\nNUMBER OF CONVERSATIONS \n\nFigure 6.  Change in FOM vs.  the number of conversations that the models have been \n\ntrained with. There were 25 male training conversations and 23 female training \nconversations. \n\n5  EXPERIMENTAL RESULTS \nExperiments were performed using a HMM wordspotter that was trained using maximum \nlikelihood algorithm.  More complicated models  were created for  words which occur fre(cid:173)\nquently in the training set. The word models for card and credit-card were increased to four \nmixtures  per  state.  The  models  for  cash,  charge,  check,  credit,  dollar,  interest,  money, \nmonth, and visa were increased to  two mixtures per state. All  other word models had one \nmixture per state. The number of states per keyword is roughly 1.5 times the number of pho(cid:173)\nnemes in each keyword.  Covariance matrices were diagonal and variances were estimated \nseparately for all  states. All  systems were trained on the first  50 talkers  in  the credit card \ntraining corpus and evaluated using the last 20 talkers. \n\nAn initial set of models was trained during 16 passes through the training data using whole(cid:173)\nword training and Viterbi alignment on only the excised words from the training conversa(cid:173)\ntions. This training provided a FOM of 62.5% on the 20 evaluation talkers. Embedded for(cid:173)\nward-backward reestimation training was  then performed where models  of keywords  and \nfillers are linked together and trained jointly on conversations which were split up into sen(cid:173)\ntence-length fragments. This second stage ofHMM training increased the FOM by two per(cid:173)\ncentage points to 64.5%. The detection rate curves of these systems are shown in Figure 2. \n\nFOM training was  then performed for six passes through the  training data.  On  each pass, \nconversations were presented in a new random order. The change in FOM for the training \nset and the evaluation set is shown in Figure 6. The FOM on the training data for both male \nand female talkers increased by more than  10 percentage points after roughly 50 conversa(cid:173)\ntions  had  been  presented. The  FOM on  the  evaluation  data  increased  by  5.2  percentage \npoints to 69.7% after three passes through the training data, but then decreased with further \ntraining. This result suggests that the extra structure learned during the final  three training \npasses is overfitting the training data and providing poor performance on the evaluation set. \nFigure 7 shows the spectrograms of high scoring true hits and false alarms for the word card \ngenerated by  our wordspotter. All false  alarms  shown are actually  the  occurrences of the \nword  car.  The spectrograms of the  true  hits and the false  alarms are  very  similar and  the \nactual excised speech segments are difficult even for humans to distinguish. \n\n\f1026 \n\nChang and Lippmann \n\nA) True hits for card \n\nFigure 7.  Spectrograms of high scoring true hit and false alarm for the word card. \n\n6  SUMMARY \nDetection of target signals embedded in a noisy background is a common and difficult prob(cid:173)\nlem  distinct  from  the  task of classification.  The  evaluation  metric  of a  spotting  system, \ncalled Figure of Merit (FOM), is also different from the classification accuracy used to eval(cid:173)\nuate classification systems. FOM training uses a gradient which directly reflects a putative \nhit's impact on the  FOM to  modify  the parameters of the  spotting system. FOM training \ndoes not require careful adjustment of thresholds and target values and has been applied to \nimprove  a  wordspotter's  FOM from  64.5%  to  69.7%  on  the  credit card  database.  POM \ntraining can also be applied to other spotting tasks such as arrhythmia detection and address \nblock location. \n\nACKNOWLEDGEMENT \nThis work was sponsored by the Advanced Research Projects Agency. The views expressed are those \nof the authors and do not reflect the official policy or position of the U.S.  Government. Portions of \nthis work used the HTK Toolkit developed by Dr.  Steve Young of Cambridge University. \n\nBIBLIOGRAPHY \nR. Lippmann & E. Singer. (1993) Hybrid HMM/Neural-NetworkApproaches to Wordspotting. In IC(cid:173)\nASSP  '93,  volume I, pages 565-568. \nJ. Rohlicek et. al. (1993) Phonetic and Language Modeling for Wordspotting. In ICASSP  '93, volume \nII, pages 459-462. \n\nT.  Zeppenfeld, R.  Houghton &  A.  Waibel.  (1993)  Improving  the  MS-TDNN for Word Spotting.  In \nICASSP  '93,  volume II, pages 475-478. \n\n\f", "award": [], "sourceid": 792, "authors": [{"given_name": "Eric", "family_name": "Chang", "institution": null}, {"given_name": "Richard", "family_name": "Lippmann", "institution": null}]}