{"title": "Backprop KF: Learning Discriminative Deterministic State Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 4376, "page_last": 4384, "abstract": "Generative state estimators based on probabilistic filters and smoothers are one of the most popular classes of state estimators for robots and autonomous vehicles. However, generative models have limited capacity to handle rich sensory observations, such as camera images, since they must model the entire distribution over sensor readings. Discriminative models do not suffer from this limitation, but are typically more complex to train as latent variable models for state estimation. We present an alternative approach where the parameters of the latent state distribution are directly optimized as a deterministic computation graph, resulting in a simple and effective gradient descent algorithm for training discriminative state estimators. We show that this procedure can be used to train state estimators that use complex input, such as raw camera images, which must be processed using expressive nonlinear function approximators such as convolutional neural networks. Our model can be viewed as a type of recurrent neural network, and the connection to probabilistic filtering allows us to design a network architecture that is particularly well suited for state estimation. We evaluate our approach on synthetic tracking task with raw image inputs and on the visual odometry task in the KITTI dataset. The results show significant improvement over both standard generative approaches and regular recurrent neural networks.", "full_text": "Backprop KF: Learning Discriminative Deterministic\n\nState Estimators\n\nTuomas Haarnoja, Anurag Ajay, Sergey Levine, Pieter Abbeel\n\n{haarnoja, anuragajay, svlevine, pabbeel}@berkeley.edu\n\nDepartment of Computer Science, University of California, Berkeley\n\nAbstract\n\nGenerative state estimators based on probabilistic \ufb01lters and smoothers are one\nof the most popular classes of state estimators for robots and autonomous vehi-\ncles. However, generative models have limited capacity to handle rich sensory\nobservations, such as camera images, since they must model the entire distribution\nover sensor readings. Discriminative models do not suffer from this limitation,\nbut are typically more complex to train as latent variable models for state estima-\ntion. We present an alternative approach where the parameters of the latent state\ndistribution are directly optimized as a deterministic computation graph, resulting\nin a simple and effective gradient descent algorithm for training discriminative\nstate estimators. We show that this procedure can be used to train state estimators\nthat use complex input, such as raw camera images, which must be processed\nusing expressive nonlinear function approximators such as convolutional neural\nnetworks. Our model can be viewed as a type of recurrent neural network, and\nthe connection to probabilistic \ufb01ltering allows us to design a network architecture\nthat is particularly well suited for state estimation. We evaluate our approach on\nsynthetic tracking task with raw image inputs and on the visual odometry task in\nthe KITTI dataset. The results show signi\ufb01cant improvement over both standard\ngenerative approaches and regular recurrent neural networks.\n\n1\n\nIntroduction\n\nState estimation is an important component of mobile robotic applications, including autonomous\ndriving and \ufb02ight [22]. Generative state estimators based on probabilistic \ufb01lters and smoothers are\none of the most popular classes of state estimators. However, generative models are limited in their\nability to handle rich observations, such as camera images, since they must model the full distribution\nover sensor readings. This makes it dif\ufb01cult to directly incorporate images, depth maps, and other\nhigh-dimensional observations. Instead, the most popular methods for vision-based state estimation\n(such as SLAM [22]) are based on domain knowledge and geometric principles. Discriminative\nmodels do not need to model the distribution over sensor readings, but are more complex to train\nfor state estimation. Discriminative models such as CRFs [16] typically do not use latent variables,\nwhich means that training data must contain full state observations. Most real-world state estimation\nproblem settings only provide partial labels. For example, we might observe noisy position readings\nfrom a GPS sensor and need to infer the corresponding velocities. While discriminative models can\nbe augmented with latent state [18], this typically makes them harder to train.\nWe propose an ef\ufb01cient and scalable method for discriminative training of state estimators. Instead of\nperforming inference in a probabilistic latent variable model, we instead construct a deterministic\ncomputation graph with equivalent representational power. This computation graph can then be\noptimized end-to-end with simple backpropagation and gradient descent methods. This corresponds\nto a type of recurrent neural network model, where the architecture of the network is informed by the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fstructure of the probabilistic state estimator. Aside from the simplicity of the training procedure, one\nof the key advantages of this approach is the ability to incorporate arbitrary nonlinear components\ninto the observation and transition functions. For example, we can condition the transitions on raw\ncamera images processed by multiple convolutional layers, which have been shown to be remarkably\neffective for interpreting camera images. The entire network, including the observation and transition\nfunctions, is trained end-to-end to optimize its performance on the state estimation task.\nThe main contribution of this work is to draw a connection between discriminative probabilistic state\nestimators and recurrent computation graphs, and thereby derive a new discriminative, deterministic\nstate estimation method. From the point of view of probabilistic models, we propose a method for\ntraining expressive discriminative state estimators by reframing them as representationally equivalent\ndeterministic models. From the point of view of recurrent neural networks, we propose an approach\nfor designing neural network architectures that are well suited for state estimation, informed by\nsuccessful probabilistic state estimation models. We evaluate our approach on a visual tracking\nproblem, which requires processing raw images and handling severe occlusion, and on estimating\nvehicle pose from images in the KITTI dataset [8]. The results show signi\ufb01cant improvement over\nboth standard generative methods and standard recurrent neural networks.\n\n2 Related Work\n\nxt+1\n\not+1\n\not\u22121\n\nxt\n\not\n\nxt\u22121\n\nFigure 1: A generative state space model\nwith hidden states xi and observation ot\ngenerated by the model. ot are observed\nat both training and test time.\n\nSome of the most successful methods for state estima-\ntion have been probabilistic generative state space mod-\nels (SSMs) based on \ufb01ltering and smoothing (Figure 1).\nKalman \ufb01lters are perhaps the best known state estimators,\nand can be extended to the case of nonlinear dynamics\nthrough linearization and the unscented transform. Non-\nparametric \ufb01ltering methods, such as particle \ufb01ltering, are\nalso often used for tasks with multimodal posteriors. For\na more complete review of state estimation, we refer the\nreader to standard references on this topic [22].\nGenerative models aim to estimate the distribution over state observation sequences o1:T as originating\nfrom some underlying hidden state x1:T , which is typically taken to be the state space of the system.\nThis becomes impractical when the observation space is extremely high dimensional, and when the\nobservation is a complex, highly nonlinear function of the state, as in the case of vision-based state\nestimation, where ot corresponds to an image viewed from a robot\u2019s on-board camera. The challenges\nof generative state space estimation can be mitigated by using complex observation models [14] or\napproximate inference [15], but building effective generative models of images remains a challenging\nopen problem.\nAs an alternative to generative models, discriminative models such as conditional random \ufb01elds\n(CRFs) can directly estimate p(xt|o1:t) [16]. A number of CRFs and conditional state space models\n(CSSMs) have been applied to state estimation [21, 20, 12, 17, 9], typically using a log-linear\nrepresentation. More recently, discriminative \ufb01ne-tuning of generative models with nonlinear neural\nnetwork observations [6], as well as direct training of CRFs with neural network factors [7], have\nallowed for training of nonlinear discriminative models. However, such models have not been\nextensively applied to state estimation. Training CRFs and CSSMs typically requires access to\ntrue state labels, while generative models only require observations, which often makes them more\nconvenient for physical systems where the true underlying state is unknown. Although CRFs have\nalso been combined with latent states [18], the dif\ufb01culty of CRF inference makes latent state CRF\nmodels dif\ufb01cult to train. Prior work has also proposed to optimize SSM parameters with respect to a\ndiscriminative loss [1]. In contrast to this work, our approach incorporates rich sensory observations,\nincluding images, and allows for training of highly expressive discriminative models.\nOur method optimizes the state estimator as a deterministic computation graph, analogous to recurrent\nneural network (RNN) training. The use of recurrent neural networks (RNNs) for state estimation\nhas been explored in several prior works [24, 4, 23, 19], but has generally been limited to simple\ntasks without complex sensory inputs such as images. Part of the reason for this is the dif\ufb01culty of\ntraining general-purpose RNNs. Recently, innovative RNN architectures have been successful at\nmitigating this problem, through models such as the long short-term memory (LSTM) [10] and the\n\n2\n\n\fot\u22121\n\nzt\u22121\n\nxt\u22121\n\nyt\u22121\n\not\n\nzt\n\nxt\n\nyt\n\n(a)\n\not+1\n\nzt+1\n\nxt+1\n\nyt+1\n\nst\u22122\n\not\u22121\n\ng\u03b8\n\nzt\u22121\n\nst\u22121\n\nst\u22121\n\n\u03ba\n\nq\n\n\u03c6yt\u22121\n\not\n\ng\u03b8\n\nzt\n\nst\n\n\u03ba\n\nq\n\n\u03c6yt\n(b)\n\nst\n\not+1\n\ng\u03b8\n\nzt+1\n\nst+1\n\n\u03ba\n\nq\n\n\u03c6yt+1\n\nst+1\n\nFigure 2: (a) Standard two-step engineering approach for \ufb01ltering with high-dimensional observations.\nThe generative part has hidden state xt and two observations, yt and zt, where the latter observation\nis actually the output of a second deterministic model zt = g\u03b8(ot), denoted by dashed lines and\ntrained explicitly to predict zt. (b) Computation graph that jointly optimizes both models in (a),\nconsisting of the deterministic map g\u03b8 and a deterministic \ufb01lter that infers the hidden state given zt.\nBy viewing the entire model as a single deterministic computation graph, it can be trained end-to-end\nusing backpropagation as explained in Section 4.\n\ngated recurrent unit (GRU) [5]. LSTMs have been combined with vision for perception tasks such as\nactivity recognition [3]. However, in the domain of state estimation, such black-box models ignore\nthe considerable domain knowledge that is available. By drawing a connection between \ufb01ltering and\nrecurrent networks, we can design recurrent computation graphs that are particularly well suited to\nstate estimation and, as shown in our evaluation, can achieve improved performance over standard\nLSTM models.\n\n3 Preliminaries\n\nPerforming state estimation with a generative model directly using high-dimensional observations\not, such as camera images, is very dif\ufb01cult, because these observations are typically produced by a\ncomplex and highly nonlinear process. However, in practice, a low-dimensional vector, zt, which can\nbe extracted from ot, can fully capture the dependence of the observation on the underlying state\nof the system. Let xt denote this state, and let yt denote some labeling of the states that we wish\nto be able to infer from ot. For example, ot might correspond to pairs of images from a camera\non an automobile, zt to its velocity, and yt to the location of the vehicle. In that case, we can \ufb01rst\ntrain a discriminative model g\u03b8(ot) to predict zt from ot in feedforward manner, and then \ufb01lter the\npredictions to output the desired state labels y1:t. For example, a Kalman \ufb01lter with hidden state\nxt could be trained to use the predicted zt as observations, and then perform inference over xt and\nyt at test time. This standard approach for state estimation with high-dimensional observations is\nillustrated in Figure 2a.\nWhile this method may be viewed as an engineering solution without a probabilistic interpretation, it\nhas the advantage that g\u03b8(ot) is trained discriminatively, and the entire model is conditioned on ot,\nwith xt acting as an internal latent variable. This is why the model does not need to represent the\ndistribution over observations explicitly. However, the function g\u03b8(ot) that maps the raw observations\not to low-dimensional predictions zt is not trained for optimal state estimation. Instead, it is trained\nto predict an intermediate variable zt that can be readily integrated into the generative \ufb01lter.\n\n4 Discriminative Deterministic State Estimation\n\nOur contribution is based on a generalized view of state estimation that subsumes the na\u00efve, piecewise-\ntrained models discussed in the previous section and allows them to be trained end-to-end using\nsimple and scalable stochastic gradient descent methods. In the na\u00efve approach, the observation\nfunction g\u03b8(ot) is trained to directly predict zt, since a standard generative \ufb01lter model does not\nprovide for a straightforward way to optimize g\u03b8(ot) with respect to the accuracy of the \ufb01lter on\nthe labels y1:T . However, the \ufb01lter can be viewed as a computation graph unrolled through time, as\nshown in Figure 2b. In this graph, the \ufb01lter has an internal state de\ufb01ned by the posterior over xt. For\n\n3\n\n\fexample, in a Kalman \ufb01lter with Gaussian posteriors, we can represent the internal state with the\ntuple st = (\u00b5xt, \u03a3xt). In general, we will use st to refer to the state of any \ufb01lter. We also augment\nthis graph with an output function q(st) = \u03c6yt that outputs the parameters of a distribution over\nlabels yt. In the case of a Kalman \ufb01lter, we would simply have q(st) = (Cy\u00b5xt, Cy\u03a3xtCT\ny), where\nthe matrix Cy de\ufb01nes a linear observation function from xt to yt.\nViewing the \ufb01lter as a computation graph in this way, g\u03b8(ot) can be trained discriminatively\non the entire sequence, rather than individually on single time steps. Let l(\u03c6yt) be a loss func-\ntion on the output distribution of the computation graph, which might, for example, be given by\nl(\u03c6yt) = \u2212 log p\u03c6yt\nis the distribution induced by the parameters \u03c6yt, and yt is\nt l(\u03c6yt ) be the loss on an entire sequence with respect to \u03b8. Furthermore,\nlet \u03ba(st, zt+1) denote the operation performed by the \ufb01lter to compute st+1 based on st and zt+1.\nWe can compute the gradient of l(\u03b8) with respect to the parameters \u03b8 by \ufb01rst recursively computing\nthe gradient of the loss with respect to the \ufb01lter state st from the back to the front according to the\nfollowing recursion:\n\nthe label. Let L(\u03b8) =(cid:80)\n\n(yt), where p\u03c6yt\n\n,\n\n(1)\n\n(2)\n\ndL\ndst\u22121\n\n=\n\nd\u03c6yt\u22121\n\ndst\n\ndL\nd\u03c6yt\u22121\n\n+\n\ndst\ndst\u22121\n\ndL\ndst\n\nand then applying the chain rule to obtain\n\n\u2207\u03b8L(\u03b8) =\n\nT(cid:88)\n\nt=1\n\ndzt\nd\u03b8\n\ndst\ndzt\n\ndL\ndst\n\n.\n\nAll of the derivatives in these equations can be obtained from g\u03b8(ot), \u03ba(st\u22121, zt), q(st), and l(\u03c6yt):\n\ndst\ndst\u22121\ndL\nd\u03c6yt\n\n= \u2207st\u22121\u03ba(st\u22121, zt),\n= \u2207\u03c6yt\nd\u03c6yt\ndst\n\nl(\u03c6yt),\n\ndst\ndzt\n= \u2207stq(st),\n\n= \u2207zt\u03ba(st\u22121, zt),\n\n= \u2207\u03b8g\u03b8(ot).\n\ndzt\nd\u03b8\n\n(3)\n\nThe parameters \u03b8 can be optimized with gradient descent using these gradients. This is an instance of\nbackpropagation through time (BPTT), a well known algorithm for training recurrent neural networks.\nRecognizing this connection between state-space models and recurrent neural networks allows us to\nextend this generic \ufb01ltering architecture and explore the continuum of models between \ufb01lters with\na discriminatively trained observation model g\u03b8(ot) all the way to fully general recurrent neural\nnetworks. In our experimental evaluation, we use a standard Kalman \ufb01lter update as \u03ba(st, zt+1), but\nwe use a nonlinear convolutional neural network observation function g\u03b8(ot). We found that this\nprovides a good trade-off between incorporating domain knowledge and end-to-end learning for the\ntask of visual tracking and odometry, but other variants of this model could be explored in future\nwork.\n\n5 Experimental Evaluation\n\nIn this section, we compare our deterministic discriminatively trained state estimator with a set\nof alternative methods, including simple feedforward convolutional networks, piecewise-trained\nKalman \ufb01lter, and fully general LSTM models. We evaluate these models on two tasks that require\nprocessing of raw image input: synthetic task of tracking a red disk in the presence of clutter and\nsevere occlusion; and the KITTI visual odometry task [8].\n\n5.1 State Estimation Models\n\nOur proposed model, which we call the \u201cbackprop Kalman \ufb01lter\u201d (BKF), is a computation graph\nmade up of a Kalman \ufb01lter (KF) and a feedforward convolutional neural network that distills the\nobservation ot into a low-dimensional signal zt, which serves as the observation for the KF. The\nneural network outputs both a point observation zt and an observation covariance matrix Rt. Since\nthe network is trained together with the \ufb01lter, it can learn to use the covariance matrix to communicate\nthe desired degree of uncertainty about the observation, so as to maximize the accuracy of the \ufb01nal\n\ufb01lter prediction.\n\n4\n\n\fFeedforward network\n\not\n\nv\nn\no\nc\n\nm\nr\no\nn\n_\np\ns\ne\nr\n\nU\nL\ne\nR\n\nl\no\no\np\n_\nx\na\nm\n\nvh1\n\nn\no\nc\n\nm\nr\no\nn\n_\np\ns\ne\nr\n\nU\nL\ne\nR\n\nl\no\no\np\n_\nx\na\nm\n\nh2\n\nc\nf\n\nh3\n\nc\nf\n\nU\nL\ne\nR\n\nU\nL\ne\nR\n\nh4\n\nzt\n\nc\nf\n\nc \u02c6Lt\n\nf\n\n\u03a3xt\u22121\n\nKalman \ufb01lter\n\n\u00b5xt\u22121\n\nA\u00b5xt\u22121\n\n\u00b5(cid:48)\n\nxt\n\nreshape\ndiag exp Lt\n\nLtLT\n\nt Rt\n\n(cid:0)Cz\u03a3(cid:48)\n\nxt\n\n\u03a3(cid:48)\n\nxt\n\nCT\nz\n\nCT\n\nz + Rt\n\nA\u03a3xt\u22121AT + BwQBT\nw\n\n\u03a3(cid:48)\n\nxt\n\n\u00b5(cid:48)\n\nxt\n\n(cid:1)\u22121 Kt\n\n(cid:0)zt \u2212 Cz\u00b5(cid:48)\n\nxt\n\n(cid:1)\n\n+ Kt\n\n(I \u2212 KtCz) \u03a3(cid:48)\n\nxt\n\nyt\n\n\u00b5xt\n\n\u03a3xt\n\nLoss\n\ni=1\n\nt=1\n\n(cid:80)N\n\n(cid:80)T\n\n(cid:13)(cid:13)(cid:13)Cy\u00b5(i)\n\nxt \u2212 y(i)\n\nt\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n1\n\n2T N\n\nFigure 3: Illustration of the computation graph for the BKF. The graph is composed of a feedforward\npart, which processes the raw images ot and outputs intermediate observations zt and a matrix \u02c6Lt\nthat is used to form a positive de\ufb01nite observation covariance matrix Rt, and a recurrent part that\nintegrates zt through time to produce \ufb01ltered state estimates. See Appendix A for details.\n\nWe compare the backprop KF to three alternative state estimators: the \u201cfeedforward model\u201d, the\n\u201cpiecewise KF\u201d, and the \u201cLSTM model\u201d. The simplest of the models, the feedforward model, does\nnot consider the temporal structure in the task at all, and consists only of a feedforward convolutional\nnetwork that takes in the observations ot and outputs a point estimate \u02c6yt of the label yt. This approach\nis viable only if the label information can be directly inferred from ot, such as when tracking an\nobject. On the other hand, tasks that require long term memory, such as visual odometry, cannot\nbe solved with a plain feedforward network. The piecewise KF model corresponds to the simple\ngenerative approach described in Section 3, which combines the feedforward network with a Kalman\n\ufb01lter that \ufb01lters the network predictions zt to produce a distribution over the state estimate \u02c6xt. The\npiecewise model is based on the same computation graph as the BKF, but does not optimize the \ufb01lter\nand network together end-to-end, instead training the two pieces separately. The only difference\nbetween the two graphs is that the piecewise KF does not implement the additional pathway for\npropagating the uncertainty from the feedforward network into the \ufb01lter, but instead, the \ufb01lter needs\nto learn to handle the uncertainty in zt independently. An example instantiation of BKF is depicted\nin Figure 3. A detailed overview of the computational blocks shown in the \ufb01gure is deferred to\nAppendix A.\nFinally, we compare to a recurrent neural network based on LSTM hidden units [10]. This model\nresembles the backprop KF, except that the \ufb01lter portion of the graph is replaced with a generic\nLSTM layer. The LSTM model learns the dynamics from data, without incorporating the domain\nknowledge present in the KF.\n\n5.2 Neural Network Design\n\nA special aspect of our network design is a novel response normalization layer that is applied to the\nconvolutional activations before applying the nonlinearity. The response normalization transforms\nthe activations such that the activations of layer i have always mean \u00b5i and variance \u03c32\ni regardless\nof the input to the layer. The parameters \u00b5i and \u03c32\ni are learned along with other parameters. This\nnormalization is used in all of the convolutional networks in our evaluation, and resembles batch\nnormalization [11] in its behavior. However, we found this approach to be substantially more effective\nfor recurrent models that require backpropagation through time, compared to the more standard\nbatch normalization approach, which is known to require additional care when applied to recurrent\nnetworks. It has been since proposed independently from our work in [2], which gives an in-depth\nanalysis of the method. The normalization is followed by a recti\ufb01ed linear unit (ReLU) and a max\npooling layer.\n\n5.3 Synthetic Visual State Estimation Task\n\nOur state estimation task is meant to re\ufb02ect some of the typical challenges in visual state estimation:\nthe need for long-term tracking to handle occlusions, the presence of noise, and the need to process\nraw pixel data. The task requires tracking a red disk from image observations, as shown in Figure\n4. Distractor disks with random colors and radii are added into the scene to occlude the red disk,\nand the trajectories of all disks follow linear-Gaussian dynamics, with a linear spring force that pulls\nthe disks toward the center of the frame and a drag force that prevents high velocities. The disks\ncan temporally leave the frame since contacts are not modeled. Gaussian noise is added to perturb\nthe motion. While these model parameters are assumed to be known in the design of the \ufb01lter, it is\na straightforward to learn also the model parameters. The dif\ufb01culty of the task can be adjusted by\nincreasing or decreasing the number of distractor disks, which affects the frequency of occlusions.\n\n5\n\n\fFigure 4: Illustration of six consecutive frames of two training sequences. The objective is to track\nthe red disk (circled in the the \ufb01rst frame for illustrative purposes) throughout the 100-frame sequence.\nThe distractor disks are sampled for each sequence at random and overlaid on top of the target disk.\nThe upper row illustrates an easy sequence (9 distractors), while the bottom row is a sequence of high\ndif\ufb01culty (99 distractors). Note that the target is very rarely visible in the hardest sequences.\n\nTable 1: Benchmark Results\n\nState Estimation Model\nfeedforward model\npiecewise KF\nLSTM model (64 units)\nLSTM model (128 units)\nBKF (ours)\n\n# Parameters RMS test error \u00b1\u03c3\n0.2322 \u00b1 0.1316\n0.1160 \u00b1 0.0330\n0.1407 \u00b1 0.1154\n0.1423 \u00b1 0.1352\n0.0537 \u00b1 0.1235\n\n7394\n7397\n33506\n92450\n7493\n\nThe easiest variants of the task are solvable with a feedforward estimator, while the hardest variants\nrequire long-term tracking through occlusion. To emphasize the sample ef\ufb01ciency of the models, we\ntrained them using 100 randomly sampled sequences.\nThe results in Table 1 show that the BKF outperforms both the standard probabilistic KF-based\nestimators and the more powerful and expressive LSTM estimators. The tracking error of the simple\nfeedforward model is signi\ufb01cantly larger due to the occlusions, and the model tends to predict the\nmean coordinates when the target is occluded. The piecewise model performs better, but because\nthe observation covariance is not conditioned on ot, the Kalman \ufb01lter learns to use a very large\nobservation covariance, which forces it to rely almost entirely on the dynamics model for predictions.\nOn the other hand, since the BKF learns to output the observation covariances conditioned on ot that\noptimize the performance of the \ufb01lter, it is able to \ufb01nd a compromise between the observations and\nthe dynamics model. Finally, although the LSTM model is the most general, it performs worse than\nthe BKF, since it does not incorporate prior knowledge about the structure of the state estimation\nproblem.\nTo test the robustness of the estimator to occlusions,\nwe trained each model on a training set of 1000 se-\nquences of varying amounts of clutter and occlusions.\nWe then evaluated the models on several test sets,\neach corresponding to a different level of occlusion\nand clutter. The tracking error as the test set dif\ufb01culty\nis varied is shown Figure 5. Note that even in the\nabsence of distractors, BKF and LSTM models out-\nperform the feedforward model, since the target oc-\ncasionally leaves the \ufb01eld of view. The performance\nof the piecewise KF does not change signi\ufb01cantly as\nthe dif\ufb01culty increases: due to the high amount of\nclutter during training, the piecewise KF learns to\nuse a large observation covariance and rely primarily\non feedforward estimates for prediction. The BKF\nachieves the lowest error in nearly all cases. At the\nsame time, the BKF also has dramatically fewer pa-\nrameters than the LSTM models, since the transitions\ncorrespond to simple Kalman \ufb01lter updates.\n\nFigure 5: The RMS error of various models\ntrained on a single training set that contained\nsequences of varying dif\ufb01culty. The models\nwere then evaluated on several test sets of\n\ufb01xed dif\ufb01culty.\n\n6\n\n020406080100# distractors10-310-210-1100RMS errorfeedforwardpiecewiseLSTM\fFigure 6: Example image sequence from the KITTI dataset (top row) and the corresponding difference\nimage that is obtained by subtracting the RGB values of the previous image from the current image\n(bottom row). The observation ot is formed by concatenating the two images into a six-channel\nfeature map which is then treated as an input to a convolutional neural network. The \ufb01gure shows\nevery \ufb01fth sample from the original sequence for illustrative purpose.\n\nTable 2: KITTI Visual Odometry Results\n\n# training trajectories\nTranslational Error [m/m]\n\npiecewise KF\nLSTM model (128 units)\nLSTM model (256 units)\nBKF (ours)\n\nRotational Error [deg/m]\n\npiecewise KF\nLSTM model (128 units)\nLSTM model (256 units)\nBKF (ours)\n\nTest 100\n6\n\n10\n\nTest 100/200/400/800\n\n3\n\n6\n\n10\n\n0.2452\n0.3456\n0.3172\n0.2346\n\n0.1028\n0.3681\n0.3391\n0.0901\n\n0.2265\n0.2769\n0.2630\n0.2062\n\n0.0978\n0.3767\n0.2933\n0.0801\n\n0.3277\n0.5491\n0.5439\n0.2982\n\n0.1069\n0.4123\n0.3845\n0.0888\n\n0.2313\n0.4732\n0.4506\n0.2031\n\n0.0768\n0. 3573\n0.3566\n0.0587\n\n0.2197\n0.4352\n0.4228\n0.1804\n\n0.0754\n0.3530\n0.3221\n0.0556\n\n3\n\n0.3257\n0.5022\n0.5199\n0.3089\n\n0.1408\n0.5484\n0.4960\n0.1207\n\n5.4 KITTI Visual Odometry Experiment\n\nNext, we evaluated the state estimation models on visual odometry task in the KITTI dataset [8]\n(Figure 6, top row). The publicly available training set contains 11 trajectories of ego-centric video\nsequences of a passenger car driving in suburban scenes, along with ground truth position and\norientation. The dataset is challenging since it is relatively small for learning-based algorithms, and\nthe trajectories are visually very diverse. For training the Kalman \ufb01lter variants, we used a simpli\ufb01ed\nstate-space model with three of the state variables corresponding to the vehicle\u2019s 2D pose (two spatial\ncoordinates and heading) and two for the forward and angular velocities. Because the dynamics\nmodel is non-linear, we equipped our model-based state estimators with extended Kalman \ufb01lters,\nwhich is a straightforward addition to the BKF framework.\nThe objective of the task is to estimate the relative change in the pose during \ufb01xed-length subse-\nquences. However, because inferring the pose requires integration over all past observations, a simple\nfeedforward model cannot be used directly. Instead, we trained a feedforward network, consisting of\nfour convolutional and two fully connected layers and having approximately half a million parameters,\nto estimate the velocities from pairs of images at consecutive time steps. In practice, we found it better\nto use a difference image, corresponding to the change in the pixel intensities between the images,\nalong with the current image as an input to the feedforward network (Figure 6). The ground truth\nvelocities, which were used to train the piecewise KF as well as to pretrain the other models, were\ncomputed by \ufb01nite differencing from the ground truth positions. The recurrent models\u2013piecewise\nKF, the BKF, and the LSTM model\u2013were then \ufb01ne-tuned to predict the vehicle\u2019s pose. Additionally,\nfor the LSTM model, we found it crucial to pretrain the recurrent layer to predict the pose from the\nvelocities before \ufb01ne-tuning.\nWe evaluated each model using 11-fold cross-validation, and we report the average errors of the\nheld-out trajectories over the folds. We trained the models by randomly sampling subsequences of\n100 time steps. For each fold, we constructed two test sets using the held-out trajectory: the \ufb01rst set\ncontains all possible subsequences of 100 time steps, and the second all subsequences of lengths\n100, 200, 400, and 800.1 We repeated each experiment using 3, 6, or all 10 of the sequences in each\ntraining fold to evaluate the resilience of each method to over\ufb01tting.\n\n1The second test set aims to mimic the of\ufb01cial (publicly unavailable) test protocol. Note, however, that\nbecause the methods are not tested on the same sequences as the of\ufb01cial test set, they are not directly comparable\nto results on the of\ufb01cial KITTI benchmark.\n\n7\n\n\fTable 2 lists the cross-validation results. As expected, the error decreases consistently as the number\nof training sequences becomes larger. In each case, BKF outperforms the other variants in both\npredicting the position and heading of the vehicle. Because both the piecewise KF and the BKF\nincorporate domain knowledge, they are more data-ef\ufb01cient. Indeed, the performance of the LSTM\ndegrades faster as the number of training sequences is decreased. Although the models were trained\non subsequences of 100 time steps, they were also tested on a set containing a mixture of different\nsequence lengths. The LSTM model generally failed to generalize to longer sequences, while the\nKalman \ufb01lter variants perform slightly better on mixed sequence lengths.\n\n6 Discussion\n\nIn this paper, we proposed a discriminative approach to state estimation that consists of reformulating\nprobabilistic generative state estimation as a deterministic computation graph. This makes it possible\nto train our method end-to-end using simple backpropagation through time (BPTT) methods, analo-\ngously to a recurrent neural network. In our evaluation, we present an instance of this approach that\nwe refer to as the backprop KF (BKF), which corresponds to a (extended) Kalman \ufb01lter combined\nwith a feedforward convolutional neural network that processes raw image observations. Our ap-\nproach to state estimation has two key bene\ufb01ts. First, we avoid the need to construct generative state\nspace models over complex, high-dimensional observation spaces such as raw images. Second, by\nreformulating the probabilistic state-estimator as a deterministic computation graph, we can apply\nsimple and effective backpropagation and stochastic gradient descent optimization methods to learn\nthe model parameters. This avoids the usual challenges associated with inference in continuous,\nnonlinear conditional probabilistic models, while still preserving the same representational power as\nthe corresponding approximate probabilistic inference method, which in our experiments corresponds\nto approximate Gaussian posteriors in a Kalman \ufb01lter.\nOur approach also can be viewed as an application of ideas from probabilistic state-space models to\nthe design of recurrent neural networks. Since we optimize the state estimator as a deterministic com-\nputation graph, it corresponds to a particular type of deterministic neural network model. However,\nthe architecture of this neural network is informed by principled and well-motivated probabilistic\n\ufb01ltering models, which provides us with a natural avenue for incorporating domain knowledge into\nthe system.\nOur experimental results indicate that end-to-end training of a discriminative state estimators can\nimprove their performance substantially when compared to a standard piecewise approach, where\na discriminative model is trained to process the raw observations and produce intermediate low-\ndimensional observations that can then be integrated into a standard generative \ufb01lter. The results also\nindicate that, although the accuracy of the BKF can be matched by a recurrent LSTM network with\na large number of hidden units, BKF outperforms the general-purpose LSTM when the dataset is\nlimited in size. This is due to the fact that BKF incorporates domain knowledge about the structure of\nprobabilistic \ufb01lters into the network architecture, providing it with a better inductive bias when the\ntraining data is limited, which is the case in many real-world robotic applications.\nIn our experiments, we primarily focused on models based on the Kalman \ufb01lter. However, our\napproach to state estimation can equally well be applied to other probabilistic \ufb01lters for which the\nupdate equations (approximate or exact) can be written in closed form, including the information\n\ufb01lter, the unscented Kalman \ufb01lter, and the particle \ufb01lter, as well as deterministic \ufb01lters such as state\nobservers or moving average processes. As long as the \ufb01lter can be expressed as a differentiable map-\nping from the observation and previous state to the new state, we can construct and differentiate the\ncorresponding computation graph. An interesting direction for future work is to extend discriminative\nstate-estimators with complex nonlinear dynamics and larger latent state. For example, one could\nexplore the continuum of models that span the space between simple KF-style state estimators and\nfully general recurrent networks. The trade-off between these two extremes is between generality and\ndomain knowledge, and striking the right balance for a given problem could produce substantially\nimproved results even with relative modest amounts of training data.\n\nAcknowledgments\n\nThis research was funded in part by ONR through a Young Investigator Program award, by the Army\nResearch Of\ufb01ce through the MAST program, and by the Berkeley DeepDrive Center.\n\n8\n\n\fReferences\n[1] P. Abbeel, A. Coates, M. Montemerlo, A. Y. Ng, and S. Thrun. Discriminative training of kalman \ufb01lters.\n\nIn Robotics: Science and Systems (R:SS), 2005.\n\n[2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.\n[3] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt. Sequential deep learning for human\naction recognition. In Second International Conference on Human Behavior Unterstanding, pages 29\u201339,\nBerlin, Heidelberg, 2011. Springer-Verlag.\n\n[4] O. Bobrowski, R. Meir, S. Shoham, and Y. C. Eldar. A neural network implementing optimal state\nestimation based on dynamic spike train decoding. In Advances in Neural Information Processing Systems\n(NIPS), 2007.\n\n[5] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase repre-\nsentations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078,\n2014.\n\n[6] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for\nlarge-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on,\n20(1):30\u201342, 2012.\n\n[7] T. Do, T. Arti, et al. Neural conditional random \ufb01elds. In International Conference on Arti\ufb01cial Intelligence\n\nand Statistics, pages 177\u2013184, 2010.\n\n[8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. International\n\nJournal of Robotics Research (IJRR), 2013.\n\n[9] R. Hess and A. Fern. Discriminatively trained particle \ufb01lters for complex multi-object tracking. In\nComputer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 240\u2013247. IEEE,\n2009.\n\n[10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780, 1997.\n[11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. In International Conference on Machine Learning (ICML), 2015.\n\n[12] M. Kim and V. Pavlovic. Conditional state space models for discriminative motion estimation. In Computer\n\nVision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1\u20138. IEEE, 2007.\n\n[13] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\n[14] J. Ko and D. Fox. GP-BayesFilters: Bayesian \ufb01ltering using Gaussian process prediction and observation\n\nmodels. Autonomous Robots, 27(1):75\u201390, 2009.\n\n[15] R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman \ufb01lters. arXiv preprint arXiv:1511.05121, 2015.\n[16] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. 2001.\n\n[17] B. Limketkai, D. Fox, and L. Liao. CRF-\ufb01lters: Discriminative particle \ufb01lters for sequential state\n\nestimation. In International Conference on Robotics and Automation (ICRA), 2007.\n\n[18] L.-P. Morency, A. Quattoni, and T. Darrell. Latent-dynamic discriminative models for continuous gesture\nrecognition. In Computer Vision and Pattern Recognition, 2007. CVPR\u201907. IEEE Conference on, pages\n1\u20138. IEEE, 2007.\n\n[19] P. Ondruska and I. Posner. Deep tracking: Seeing beyond seeing using recurrent neural networks. arXiv\n\npreprint arXiv:1602.00991, 2016.\n\n[20] D. A. Ross, S. Osindero, and R. S. Zemel. Combining discriminative features to infer complex trajectories.\nIn Proceedings of the 23rd international conference on Machine learning, pages 761\u2013768. ACM, 2006.\n[21] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative density propagation for 3d human\nmotion estimation. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer\nSociety Conference on, volume 1, pages 390\u2013397. IEEE, 2005.\n\n[22] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. The MIT Press, 2005.\n[23] R. Wilson and L. Finkel. A neural implementation of the kalman \ufb01lter. In Advances in neural information\n\nprocessing systems, pages 2062\u20132070, 2009.\n\n[24] N. Yadaiah and G. Sowmya. Neural network based state estimation of dynamical systems. In International\n\nJoint Conference on Neural Networks (IJCNN), 2006.\n\n9\n\n\f", "award": [], "sourceid": 2158, "authors": [{"given_name": "Tuomas", "family_name": "Haarnoja", "institution": "UC Berkeley"}, {"given_name": "Anurag", "family_name": "Ajay", "institution": "UC Berkeley"}, {"given_name": "Sergey", "family_name": "Levine", "institution": "University of Washington"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "OpenAI / UC Berkeley / Gradescope"}]}