{"title": "Dual Kalman Filtering Methods for Nonlinear Prediction, Smoothing and Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 799, "abstract": null, "full_text": "Dual Kalman Filtering Methods for \nNonlinear Prediction,  Smoothing,  and \n\nEstimation \n\nEric A.  Wan \n\nericwan@ee.ogi.edu \n\nAlex T.  Nelson \natnelson@ee.ogi.edu \n\nDepartment of Electrical  Engineering \n\nOregon  Graduate Institute \n\nP.O. Box 91000  Portland, OR 97291 \n\nAbstract \n\nPrediction,  estimation,  and smoothing  are  fundamental  to  signal \nprocessing.  To perform  these  interrelated  tasks  given  noisy  data, \nwe  form  a  time  series  model  of  the  process  that  generates  the \ndata.  Taking noise in the system explicitly into account, maximum(cid:173)\nlikelihood and Kalman frameworks are discussed  which involve the \ndual process  of estimating both the model parameters and the un(cid:173)\nderlying state of the system.  We  review  several  established  meth(cid:173)\nods in the linear case,  and propose severa! extensions utilizing dual \nKalman filters  (DKF)  and forward-backward  (FB)  filters  that  are \napplicable  to  neural  networks.  Methods  are  compared on  several \nsimulations of noisy  time series.  We  also  include  an  example  of \nnonlinear noise reduction in speech. \n\n1 \n\nINTRODUCTION \n\nConsider the general  autoregressive  model of a noisy  time series  with  both process \nand additive observation noise: \n\nx(k) \ny(k) \n\nI(x(k - 1), ... x(k - M), w) + v(k - 1) \nx(k) + r(k), \n\n(1) \n(2) \n\nwhere  x(k)  corresponds  to the  true underlying  time series  driven  by  process  noise \nv(k),  and 10 is  a  nonlinear  function  of past  values  of x(k)  parameterized  by  w. \n\n\f794 \n\nE. A.  Wan and A. T.  Nelson \n\nThe only available observation is y(k)  which contains additional additive noise r(k) . \nPrediction  refers  to  estimating an  x(k)  given  past  observations.  (For  purposes  of \nthis  paper  we  will  restrict  ourselves  to univariate time series.)  In  estimation,  x(k) \nis  determined  given  observations  up  to  and  including  time  k.  Finally,  smoothing \nrefers  to estimating x(k)  given all observations,  past and future. \nThe minimum mean square  nonlinear prediction  of x(k)  (or  of y(k))  can  be  writ(cid:173)\nten  as  the  conditional  expectation  E[x(k)lx(k - 1)],  where  x(k)  =  [x(k), x(k -\n1),\u00b7 .. x(O)] .  If the  time series  x(k)  were  directly  available,  we  could  use  this data \nto generate an approximation of the optimal predictor.  However,  when  x(k)  is  not \navailable (as is  generally  the case),  the common approach  is  to  use  the noisy  data \ndirectly,  leading to an approximation of E[y(k)ly(k -1)] . However,  this results in a \nbiased predictor:  E[y(k)ly(k-l)] =  E[x(k)lx(k -1) +R(k -1)] i=  E[x(k)lx(k-l)]. \nWe  may  reduce  the  above  bias  in  the predictor  by  exploiting the  knowledge  that \nthe observations y(k)  are measurements arising from a  time series.  Estimates x(k) \nare  found  (either  through  estimation  or  smoothing)  such  that  Ilx(k) - x(k)11  < \nII x (k ) - y( k) II.  These estimates are then used to form a predictor that approximates \nE[x(k)lx(k - 1)].1 \n\nIn the remainder of this paper,  we  will develop  methods for  the dual estimation of \nboth states x and weights Vi.  We show  how  a  maximum-likelihood framework can \nbe  used  to  relate  several  existing  algorithms and  how  established  linear  methods \ncan be extended to a  nonlinear framework.  New  methods involving the use of dual \nKalman filters  are also  proposed and experiments are provided  to compare results. \n\n2  DUAL ESTIMATION \n\nGiven only noisy observations y(k),  the dual estimation problem requires considera(cid:173)\ntion of both the standard prediction (or output) errors ep(k)  = y(k) - f(ic.(k-1)' w) \nas  well  as  the observation (or  input) errors eQ(k)  =  y(k)  - x(k) . The minimum ob(cid:173)\nservation error variance equals the noise variance 0';. The prediction error, however, \nis correlated with the observation error since  y(k)  - f(x(k - 1)) = r(k - 1) + v(k), \nand  thus  has  a  minimum variance of 0';  + 0';.  Assuming the errors  are  Gaussian, \nwe  may construct a log-likelihood function which is proportional to eT:E-1e, where \neT  = [eQ(O), eQ(l) .... eQ(N), ep(M), ep(M + 1), .. . ep(N)],  a  vector of all errors  up to \ntime N,  and \n\no \n\no \no \n\n(3) \n\nMinimization of the  log-likelihood function  leads  to  the  maximum-likelihood esti(cid:173)\nmates for both x(k)  and w.  (Although we  may also estimate the noise variances 0'; \nand 0';, we will assume in this paper that they are known.)  Two general frameworks \nfor  optimization are available: \n\nlBecause  models  are  trained  on  estimated  data x(k),  it  is  important  that  estimated \ndata still be used for prediction of out-of training set (on-line)  data.  In other words, if our \nmodel  was formed as an approximation  of E[x(k)lx(k - 1)],  then we  should not provide it \nwith y(k - 1) as an input in order to avoid a  model  mismatch. \n\n\fDual Kalman Filtering Methods \n\n795 \n\n2.1  Errors-In-Variables (EIV)  Methods \n\nThis method comes from the statistics literature for nonlinear regression  (see  Seber \nand Wild,  1989),  and involves batch optimization of the cost function  in Equation \n3.  Only minor modifications are made to account for  the time series  model.  These \nmethods,  however,  are  memory intensive  (E is  approx.  2N  ><  2N)  and also do not \naccommodate new  data in  an  efficient  manner.  Retraining is  necessary  on  all  the \ndata in order  to produce estimates for  the new  data points. \n\nIf we ignore the cross correlation between the prediction and observation error, then \nE  becomes  a  diagonal  matrix  and  the  cost  function  may  be  expressed  as  simply \n2::=1 \"Ye~(k) + e~(k), with  \"Y  = (J';/((J'; + (J';).  This  is equivalent  to  the  Gleaming \n(CLRN) cost function (Weigend, 1995), developed as a heuristic method for cleaning \nthe  inputs in neural  network  modelling problems.  While this allows for  stochastic \noptimization,  the  assumption in  the  time series  formulation  may  lead  to severely \nbiased results.  Note  also that no estimate is  provided for  the last point x(N). \nWhen the model/ = w T x  is known  and linear, EIV reduces  to a standard  (batch) \nweighted  least  squares  procedure  which  can  be  solved  in  closed  form  to  generate \na  maximum-likelihood estimate of the  noise  free  time series.  However,  when  the \nlinear model is  unknown,  the problem is  far  more complicated.  The inner  product \nof the parameter vector w  with the vector x( k - 1)  indicates a  bilinear relationship \nbetween  these unknown quantities.  Solving for  x( k)  requires  knowledge of w, while \nsolving for  w  requires  x(k).  Iterative  methods  are  necessary  to solve  the  nonlin(cid:173)\near  optimization,  and  a  Newton's-type  batch  method  is  typically  employed.  An \nEIV  method for  nonlinear models is also readily developed,  but the computational \nexpense makes it less  practical in  the context  of neural  networks. \n\n2.2  Kalman Methods \n\nKalman methods involve reformulation of the problem into a state-space framework \nin order to efficiently optimize the cost function in a recursive manner.  At each time \npoint, an optimal estimation is  achieved  by combining both a  prior prediction and \nnew observation.  Connor (1994),  proposed using an  Extended Kalman filter  with  a \nneural network to perform state estimation alone.  Puskorious and Feldkamp (1994) \nand others  have  posed  the  weight  estimation in  a  state-space framework  to  allow \nKalman  training  of a  neural  network.  Here  we  extend  these  ideas  to  include  the \ndual Kalman estimation of both states and weights for efficient maximum-likelihood \noptimization.  We also introduce the use offorward-backward in/ormation filters and \nfurther  explicate relationships to the EIV  methods. \n\nA state-space formulation of Equations 1 and 2 is as follows: \n=  F[x(k - 1)] + Bv(k - 1) \n=  Cx(k) + r(k) \n\nx(k) \ny(k) \n\n(4) \n(5) \n\nwhere \n\nx(k) = \n\n[ \n\n1 \n\nF[x(k)] = \n\nf(x(k), ... , x(k - M + 1), w) 1 \n~(k) \nx(k - M + 2) \n\nB  =  [ il'  (6) \n\n[ \n\nx(k) \nx(k - 1) \n. \n~(k - M + 1) \n\n\f796 \n\nE.  A. Wan and A. T.  Nelson \n\nand  C  =  BT.  If the  model  is  linear,  then  f(x(k))  takes  the form  wT x(k),  and \nF[x(k)] can be written  as  Ax(k), where  A is in controllable canonical form. \n\nIf the  model  is  linear,  and  the  parameters  ware known,  the  Kalman filter  (KF) \nalgorithm  can  be  readily  used  to estimate  the  states  (see  Lewis,  1986).  At  each \ntime step,  the filter  computes the linear least squares estimate x(k)  and prediction \nx-(k), as  well  as  their error covariances,  Px(k)  and  P.;(k).  In the linear case with \nGaussian statistics,  the estimates are  the minimum mean square estimates.  With \nno prior information on  x, they reduce  to the maximum-likelihood estimates. \n\nNote,  however,  that  while  the Kalman filter  provides  the  maximum-likelihood es(cid:173)\ntimate at each  instant  in  time given  all  past  data,  the  EIV  approach  is  a  batch \nmethod that  gives  a  smoothed estimate given  all data.  Hence,  only  the estimates \nx(N) at the final  time step will match.  An exact equivalence for  all time is achieved \nby  combining the  Kalman filter  with  a  backwards  information filter  to  produce  a \nforward-backward  (FB)  smoothing filter  (Lewis,  1986).2  Effectively,  an  inverse co(cid:173)\nvariance  is  propagated  backwards  in  time to form  backwards state estimates  that \nare combined with the forward estimates.  When the data set  is  large,  the FB filter \noffers  Significant computational advantages over  the  batch form . \n\nWhen  the  model  is  nonlinear,  the  Kalman  filter  cannot  be  applied  directly,  but \nrequires a  linearization of the nonlinear model at the each time step.  The resulting \nalgorithm is  known  as  the  extended  Kalman filter  (EKF)  and  effectively  approxi(cid:173)\nmates the nonlinear function  with  a  time-varying linear one. \n\n2.2.1  Batch Iteration for Unknown Models \nAgain, when the linear model is unknown, the bilinear relationship between the time \nseries estimates,  X,  and the weight  estimates,  Vi  requires an iterative optimization. \nOne approach (referred to as LS-KF)  is to use a Kalman filter to estimate x(k) with \nVi  fixed,  followed  by  least-squares  optimization to  find  Vi  using  the current  x( k). \nSpecifically,  the parameters are estimated as  Vi =  (X~FXKF) -1 XKFY,  where  XKF \nis  a  matrix of KF state estimates,  and Y is a  1 x  N  vector of observations. \n\nFor nonlinear models, we use a feedforward neural network to approximate f(\u00b7),  and \nreplace the LS and KF procedures by backpropagation and extended Kalman filter(cid:173)\ning,  respectively  (referred  to  here  as  BP-EKF,  see  Connor  1994).  A  disadvantage \nof this  approach  is  slow  convergence,  due  to keeping  a  set  of inaccurate estimates \nfixed  at each  batch optimization stage. \n\n2.2.2  Dual Kalman Filter \nAnother approach for  unknown models is to concatenate both wand x  into a joint \nstate  vector.  The  model  and  time  series  are  then  estimated  simultaneously  by \napplying an EKF to the nonlinear joint state equations (see Goodwin and Sin,  1994 \nfor  the linear case).  This algorithm, however,  has been  known  to have convergence \nproblems. \n\nAn alternative is to construct  a separate state-space formulation for  the underlying \nweights as follows: \n\n(7) \n(8) \n\nw(k) \ny(k) \n\n= \n\nw(k -1) \nf(ic.(k - 1), w(k)) + n(k), \n\n2 A  slight  modification  of  the  cost  in  Equation  3  is  necessary  to  account  for  initial \n\nconditions  in the Kalman  form. \n\n\fDual Kalman Filtering Methods \n\n797 \n\nwhere the state transition is simply an identity matrix, and f(x(k-1), w(k))  plays \nthe role of a time-varying nonlinear observation on w. \n\nWhen the unknown  model is linear, the observation takes the form x(k _1)Tw(k). \nThen  a  pair  of dual  Kalman  filters  (DKF)  can  be  run  in  parallel,  one  for  state \nestimation,  and  one for  weight  estimation  (see  Nelson,  1976) .  At  each  time step, \nall current estimates are used.  The dual approach essentially allows us  to separate \nthe  non-linear  optimization into  two  linear  ones.  Assumptions  are  that x and w \nremain uncorrelated  and  that statistics remain  Gaussian.  Note,  however,  that  the \nerror in each filter should be accounted for  by the other.  We have developed several \napproaches to address this coupling, but only present one here for the sake of brevity. \nIn short,  we  write  the  variance of the noise  n( k)  as  0 p~ (k )OT + (J'; . in  Equation \n8,  and replace  v(k - 1)  by v(k - 1) + (w(k)T - wT(k))x(k - 1)  in  Equation 4 for \nestimation of x(k).  Note  that the ability to couple statistics in  this manner is  not \npossible in  the batch  approaches. \n\nWe  further  extend  the  DKF  method  to  nonlinear  neural  network  models  by  in(cid:173)\ntroducing a  dual extended Kalman filtering method (DEKF) . This simply requires \nthat Jacobians of the neural network be computed for both filters at each  time step. \nNote, by feeding x(k)  into the network, we are implicitly using a recurrent  network. \n\n2.2.3  Forward-Backward Methods \n\nAll  of the Kalman methods can  be reformulated  by  using forward-backward  (FB) \nKalman  filtering  to further  improve state smoothing.  However,  the  dual  Kalman \nmethods  require  an  interleaving  of the  forward  and  backward  state  estimates  in \norder  to generate  a  smooth  update  at each  time step.  In  addition,  using  the  FB \nestimates requires  caution  because  their  noncausal  nature can  lead  to  a  biased w \nif they  are  used  improperly.  Specifically,  for  LS-FB  the  weights  are computed  as: \nw =  (XRFXFB)-lXKFY  ,where  XFB  is  a  matrix of FB  (smooth) state estimates. \nEquivalent  adjustments  are  made  to  the  dual  Kalman  methods.  Furthermore,  a \nmodel of the time-reversed system is required for the nonlinear case.  The explication \nand results  of these  algorithms will be appear in a future  publication. \n\n3  EXPERIMENTS \n\nTable  1 compares  the  different  approaches  on  two  linear  time  series,  both  when \nthe linear model is  known  and when  it is  unknown.  The least square  (LS)  estima(cid:173)\ntion for  the weights  in the bottom row  represents  a  baseline  performance wherein \nno  noise  model  is  used.  In-sample  training  set  predictions  must  be  interpreted \ncarefully  as  all  training  set  data  is  being  used  to  optimize  for  the  weights.  We \nsee  that  the Kalman-based methods  perform better  out of training set  (recall  the \nmodel-mismatch  issuel ).  Further,  only  the  Kalman methods  allow  for  on-line es(cid:173)\ntimations (on  the test  set,  the state-estimation Kalman filters  continue to operate \nwith  the  weight  estimates fixed).  The forward-backward  method further  improves \nperformance over  KF  methods.  Meanwhile,  the  clearning-equivalent cost function \nsacrifices  both state and weight estimation MSE for improved in-sample prediction; \nthe resulting test set  performance is  significantly worse. \n\nSeveral  time series  were  used  to  compare the nonlinear  methods,  with  the  results \nsummarized in  Table  2.  Conclusions  parallel  those for  the linear case.  Note,  the \nDEKF method performed better than the baseline provided by standard backprop-\n\n\f798 \n\nE.  A.  Wan and A. T.  Nelson \n\nTable 1:  Comparison of methods for  two linear models \n\nModel  Known \n\nTram  1 \n\nEst.  Pred. \n.322 \n.094 \n.134 \n.203 \n.134 \n.559 \n.559 \n.094 \n\nMLJ:<; \nCLRN \n\nKF \nFB \n\nTest  1 \n\nEst. \n\n-\n-\n.132 \n.132 \n\nPred. \n1.09 \n1.08 \n0.59 \n0.59 \n\nw \n-\n-\n-\n-\n\nTram 2 \n\nJ:<;st. \n.165 \n.343 \n.197 \n.165 \n\nPred. \n.558 \n.342 \n.778 \n.778 \n\nTest  2 \nJ:<;st. \n-\n-\n.221 \n.221 \n\nPred. \n1.32 \n1.32 \n0.85 \n0.85 \n\nw \n-\n-\n-\n-\n\nModel  Unknown \n\n.138 \n.099 \n.135 \n.096 \n-\n\nEst.  Pred. \n\nEst.  Pred. \n\nw \n\n.563 \n.347 \n.557 \n.329 \n.886 \n\n.139 \n.136 \n.133 \n.134 \n-\n\nEIV \nCLRN \nLS-KF \nLS-FB \nUK!\" \nDFB \nLS \n\nw \n.122 \n11.28 \n.325 \n.369 \n.149 \n.065 \n0.590 \nMSE  values  for  estimation  (Est.),  prediction  (Pred.)  and  weights  (w)  (normalized  to \nsignal  var.).  1 - AR(ll)  model,  (1';  =  4,  (1';  =  1.  2000  training  samples,  1000  testing \nsamples.  EIV  and  CLRN  were  not  computed  for  the  unknown  model  due  to  memory \nconstraints.  2 - AR(5)  model,  (1';  =  .7., (1';  =  .5.,  375 training,  125  testing. \nTable 2:  Comparison of methods on nonlinear time series \n\nEst.  Pred. \n.172 \n.545 \n.049 \n.278 \n.197 \n.778 \n.612 \n.169 \n.198 \n.779 \n.587 \n.165 \n-\n1.08 \n\nEst.  Pred. \n-\n1.81 \n-\n14.1 \n0.85 \n.226 \n0.89 \n.229 \n.221 \n.863 \n.859 \n.221 \n1.32 \n\n.605 \n.603 \n.595 \n.596 \n1.09 \n\n.134 \n.281 \n.212 \n.187 \n.612 \n\n-\n\nNNet 3 \n\nTram \nPro \n.59 \n.56 \n.68 \n\nEs. \n.16 \n.14 \n.92 \n\nTest \n\nEs. \n.17 \n.14 \n.92 \n\nPro \n.59 \n.55 \n.68 \n\nNNet 1 \n\nTram \nPro \n. 58 \n.57 \n.57 \n\nEs. \n.17 \n.14 \n.95 \n\nBP-EKF \nDEKF \n\nBP \n\nTest \n\nEs . \n.15 \n.13 \n.95 \n\nPro \n.63 \n.59 \n.69 \n\nNNet  2 \n\nTram \nPro \n.31 \n.30 \n.30 \n\nEs. \n.08 \n.07 \n.22 \n\nTest \n\nEs. \n.08 \n.06 \n.29 \n\nPro \n.33 \n.32 \n.36 \n\nThe series Nnet  1,2,3 are generated  by autoregressive neural  networks which  exhibit limit \ncycle  and  chaotic  behavior.  (1';  =  .16,  (1';  =  .81,  2700  training  samples,  1300  testing \nsamples.  All  network  models fit  using  10  inputs  and  5  hidden  units.  Cross-validation \nwas  not  used  in  any  of the  methods. \n\nagation (wherein no model of the noise is  used).  The DEKF method exhibited fast \nconvergence,  requiring  only  10-20  epochs  for  training.  A  DEFB  method  is  under \ndevelopment. \n\nThe  DEKF was  tested on a  speech  signal corrupted with simulated bursting white \nnoise  (Figure 1).  The method was  applied to successive  64ms (512  point)  windows \nof the signal,  with  a  new  window  starting every  8ms  (64  points).  The  results  in \nthe  figure  were  computed  assuming  both  (1';  and  (1';  were  known.  The  average \nSNR is  improved  by  9.94  dB.  We  also  ran  the  experiment  when  (1';  and  (1';  were \nestimated using only the noisy signal (Nelson and Wan, 1997), and acheived an SNR \nimprovement of 8.50  dB.  In comparison,  available  \"state-of-the-art\"  techniques  of \nspectral  subtraction  (Boll,  1979)  and  RASTA processing  (Hermansky  et  al.,  1995), \nachieve  SNR improvements of only  .65  and  1.26  dB,  respectively.  We  extend  the \nalgorithms to the colored noise case in a  second  paper (Nelson  and Wan,  1997). \n\n4  CONCLUSIONS \nWe have described various methods under a Kalman framework for the dual estima(cid:173)\ntion of both states and weights of a  noisy  time series.  These methods utilize both \n\n\fDual Kalman Filtering Methods \n\n-1Itf .... , ~. t .... ~ \u2022 ..------il'\"  ? \n\nIIIIiII \n\n799 \n\nClean Speech \n\nNoise \n\nNoisy Speech \n\nCleaned Speech \n\n... -'\" \n\nFigure  1:  Cleaning  Noisy  Speech  With  The  DEKF.  33,000  pts  (5  sec.)  shown. \n\nprocess  and  observation  noise  models  to  improve estimation  performance.  Work \nin  progress  includes extensions  for  colored  noise,  blind signal separation,  forward(cid:173)\nbackward filtering,  and  noise  estimation.  While further  study  is  needed,  the  dual \nextended  Kalman  filter  methods  for  neural  network  prediction,  estimation,  and \nsmoothing offer  potentially powerful  new  tools for  signal processing  applications. \n\nAcknowledgements \nThis  work  was  sponsored  in  part  by  NSF  under  grant  ECS-9410823  and  by \nARPA/ AASERT Grant DAAH04-95-1-0485. \n\nReferences \nS.F. Boll.  Suppression of acoustic noise in speech  using spectral subtraction.  IEEE \nASSP-27, pp.  113-120.  April  1979. \n\nJ.  Connor,  R.  Martin,  L.  Atlas.  Recurrent  neural networks and robust  time series \nprediction.  IEEE  Tr.  on  Neural  Networks.  March  1994. \nF. Lewis.  Optimal Estimation John Wiley & Sons, Inc.  New  York.  1986. \n\nG.  Goodwin,  K.S.  Sin.  Adaptive  Filtering  Prediction  and  Control.  Prentice-Hall, \nInc.,  Englewood Cliffs,  NJ.  1994. \n\nH.  Hermansky,  E.  Wan,  C.  Avendano.  Speech  enhancement  based  on  temporal \nprocessing.  ICASSP Proceedings.  1995. \n\nA.  Nelson,  E.  Wan.  Neural  speech  enhancement  using  dual  extended  Kalman fil(cid:173)\ntering.  Submitted to  ICNN'97. \n\nL.  Nelson,  E.  Stear.  The simultaneous on-line estimation of parameters and states \nin linear systems.  IEEE  Tr.  on  Automatic  Control.  February,  1976. \n\nG.  Puskorious,  L.  Feldkamp.  Neural  control  of nonlinear  dynamic  systems  with \nkalman filter  trained recurrent  networks.  IEEE  Tm.  on  NN,  vol.  5,  no.  2.  1994. \n\nG.  Seber,  C.  Wild.  Nonlinear Regression.  John Wiley & Sons.  1989. \n\nA.  Weigend,  H.G.  Zimmerman.  Clearning.  University of Colorado Computer Sci(cid:173)\nence  Technical  Report CU-CS-772-95.  May,  1995. \n\n\f", "award": [], "sourceid": 1202, "authors": [{"given_name": "Eric", "family_name": "Wan", "institution": null}, {"given_name": "Alex", "family_name": "Nelson", "institution": null}]}