{"title": "Adaptive Path-Integral Autoencoders: Representation Learning and Planning for Dynamical Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 8927, "page_last": 8938, "abstract": "We present a representation learning algorithm that learns a low-dimensional latent dynamical system from high-dimensional sequential raw data, e.g., video. The framework builds upon recent advances in amortized inference methods that use both an inference network and a refinement procedure to output samples from a variational distribution given an observation sequence, and takes advantage of the duality between control and inference to approximately solve the intractable inference problem using the path integral control approach. The learned dynamical model can be used to predict and plan the future states; we also present the efficient planning method that exploits the learned low-dimensional latent dynamics. Numerical experiments show that the proposed path-integral control based variational inference method leads to tighter lower bounds in statistical model learning of sequential data. Supplementary video: https://youtu.be/xCp35crUoLQ", "full_text": "Adaptive Path-Integral Autoencoder: Representation\n\nLearning and Planning for Dynamical Systems\n\nJung-Su Ha, Young-Jin Park, Hyeok-Joo Chae, Soon-Seo Park, and Han-Lim Choi\n\nDepartment of Aerospace Engineering & KI for Robotics, KAIST\n\n{{jsha, yjpark, hjchae, sspark}@lics., hanlimc@}kaist.ac.kr\n\nDaejeon 305-701, Republic of Korea\n\nAbstract\n\nWe present a representation learning algorithm that learns a low-dimensional latent\ndynamical system from high-dimensional sequential raw data, e.g., video. The\nframework builds upon recent advances in amortized inference methods that use\nboth an inference network and a re\ufb01nement procedure to output samples from\na variational distribution given an observation sequence, and takes advantage of\nthe duality between control and inference to approximately solve the intractable\ninference problem using the path integral control approach. The learned dynamical\nmodel can be used to predict and plan the future states; we also present the ef\ufb01-\ncient planning method that exploits the learned low-dimensional latent dynamics.\nNumerical experiments show that the proposed path-integral control based varia-\ntional inference method leads to tighter lower bounds in statistical model learning\nof sequential data. The supplementary video1 and the implementation code2 are\navailable online.\n\n1\n\nIntroduction\n\nUnsupervised learning of the underlying dynamics of sequential high-dimensional sensory inputs is\nthe essence of intelligence, because the agent should utilize the learned dynamical model to predict\nand plan the future state. Such learning problems are formulated as latent or generative model\nlearning assuming that observations were emerged from the low-dimensional latent states, which\nincludes an intractable posterior inference of latent states for given input data. In the amortized\nvariational inference framework, an inference network is introduced to output variational parameters\nof an approximate posterior distribution. This allows for a fast approximate inference procedure and\nef\ufb01cient end-to-end training of the generative and inference networks when the learning signals from\na loss function are back-propagated into the inference network with reparameterization trick [Kingma\nand Welling, 2014, Rezende et al., 2014]. The learning procedure is based on optimization of a\nsurrogate loss, a lower-bound of data likelihood, which results in two source of sub-optimality:\nan approximation gap and an amortization gap [Krishnan et al., 2018, Cremer et al., 2018]; the\nformer comes from the sub-optimality of variational approximation (the gap between true posterior\nand optimal variational distribution) and the latter is caused by the amortized approximation (the\ngap between the optimal variational distribution and the distribution from the inference network).\nRecently, several works, e.g., [Hjelm et al., 2016, Krishnan et al., 2018, Kim et al., 2018], combined\niterative re\ufb01nement procedures with the amortized inference, where the output distribution of the\ninference network is used as a warm-start point of re\ufb01nement. This technique is referred to as the\nsemi-amortized inference and, since re\ufb01ned variational distributions do not rely only on the inference\nnetwork, the sub-optimality from amortization gap can be mitigated.\n\n1https://youtu.be/xCp35crUoLQ\n2https://github.com/yjparkLiCS/18-NeurIPS-APIAE\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFor sequential data modeling, a generative model should be considered as a dynamical system\nand a more sophisticated (approximate) inference method is required. With the assumption that\nthe underlying dynamics has the Markov property, a state space model can be introduced and it\nallows the inference network to be structured so as to mimic the factorized form of a true posterior\ndistribution [Krishnan et al., 2017, Karl et al., 2017, Fraccaro et al., 2017]. The ef\ufb01cient end-to-end\ntraining with the amortized inference is also possible here, where the inference network should output\nthe variational distribution of latent state trajectories for given observation sequences. Even when\nthe inference network is structured, the amortization gap increases inevitably because the inference\nshould be performed in the trajectory space.\nIn this work, we present a semi-amortized variational inference method operated in the trajectory\nspace. For a generative model given by a state space model, an initial state distribution and control\ninputs serve as parameters of variational distributions; the inference network is trained to output these\nvariational parameters such that the corresponding latent trajectory well-describes the observation\nsequence. In this certain formulation, the divergence between the prior and the variational distribution\nis naturally derived from stochastic calculus and then the inference problem can be converted into a\nstochastic optimal control (SOC) problem, i.e., so-called control-inference duality [Todorov, 2008,\nRuiz and Kappen, 2017]. In the SOC view, what the inference network does is to approximate the\noptimal control policy, which is hardly thought to be well-done when we observe that SOC problems\nare hard to solve at once, so iterative methods are generally used to solve the problems [Todorov,\n2008, Tamar et al., 2016, Okada et al., 2017]. Thus, we adopt the adaptive path-integral control\nmethod to iteratively re\ufb01ne the variational parameters. We show that because samples from the re\ufb01ned\nvariational distribution build tighter lower-bound and all the re\ufb01nement procedures are differentiable,\nef\ufb01cient end-to-end training is possible. Moreover, because the proposed framework is based on the\nSOC method, the same structure can be utilized to plan the future observation sequence, where the\nlearned low-dimensional stochastic dynamics is used to explore the high-dimensional observation\nspace ef\ufb01ciently.\n\n2 Background\n\n2.1 Statistical Modeling of Sequential Observations\nSuppose that we have a set of observation sequences {x(i)\n1:K \u2261 {xk;\u2200k =\n1, ..., K}(i) are i.i.d. sequences of observation that lie on (possibly high-dimensional) data space,\nX \u2282 Rdx. The problem of interest is to build a probabilistic model that explains the given observations\nwell. If a model is parameterized with \u03b8, the problem is formulated as a maximum likelihood\nestimation (MLE) problem:\n\n1:K}i=1,...,I, where x(i)\n\n\u03b8\u2217 = argmax\n\nlog p\u03b8(x(i)\n\n1:K).\n\n(1)\n\nIn this work, the observations are assumed to be emerged from a latent dynamical system, where a\nlatent state trajectory, z[0,T ] \u2261 {z(t); \u2200t \u2208 [0, T ]}, lies on a (possibly low-dimensional) latent space,\nZ \u2282 Rdz:\n\np\u03b8(x1:K|z[0,T ])dp\u03b8(z[0,T ]),\n\np\u03b8(x1:K) =\n\n(2)\nwhere p\u03b8(x1:K|z[0,T ]) and p\u03b8(z[0,T ]) are called a conditional likelihood and a prior distribution,\nrespectively3 . In particular, we consider the state space model where latent states are governed by\na continuous-time stochastic differential equation (SDE), i.e., the prior p\u03b8(z[0,T ]) is a probability\nmeasure of a following system:\n\n(3)\nwhere w(t) is a du-dimensional Wiener process. Additionally, a conditional likelihood of sequential\nobservations is assumed to be factorized along the time axis:\n\ndz(t) = f (z(t))dt + \u03c3(z(t))dw(t), z(0) \u223c p0(\u00b7),\n\n(cid:88)\n\ni\n\n\u03b8\n\n(cid:90)\n\nK(cid:89)\n\np\u03b8(xk|z(tk)),\nwhere {tk} is a sequence of discrete time points with t1 = 0, tK = T .\n\np\u03b8(x1:K|z[0,T ]) =\n\nk=1\n\n(4)\n\n3Because each observation trajectory can be considered independently, we leave trajectory index, i, out and\n\nrestrict our discussion to one trajectory for the sake of notational simplicity.\n\n2\n\n\f2.2 Amortized Variational Inference and Multi-Sample Objectives\n\n(cid:90)\n\n(cid:20)\n\n(cid:21)\n\np\u03b8(x|z)p\u03b8(z)\n\nThe objective function (1) cannot be optimized directly because it contains the intractable integration.\nTo circumvent the intractable inference, a variational distribution q(\u00b7) is introduced and then a surro-\ngate loss function L(q, \u03b8; x), which is called the evidence lower bound (ELBO), can be considered\nalternatively:\n\n\u2261 L(q, \u03b8; x),\n\np\u03b8(x|z)p\u03b8(z)dz \u2265 Eq(z)\n\nlog\n\nq(z)\n\nlog p\u03b8(x) = log\n\n(5)\nwhere q(\u00b7) can be any probabilistic distribution over Z of which support includes that of p\u03b8(\u00b7). The\ngap between the log-likelihood and the ELBO is the Kullback\u2013Leibler (KL) divergence between q(z)\nand the posterior p\u03b8(z|x):\n\nlog p\u03b8(x) \u2212 L(q, \u03b8; x) = DKL(q(z)||p\u03b8(z|x)).\n\n(6)\nIn particular, the amortized variational inference approach introduces a conditional variational\ndistribution, z \u223c q\u03c6(\u00b7|x), to approximate the intractable posterior distribution. The variational\ndistribution q\u03c6(\u00b7|x), which is referred to as the inference network, is parameterized by \u03c6, so \u03b8\nand \u03c6 can be simultaneously updated with (cid:53)(\u03b8,\u03c6)L(q\u03c6, \u03b8; x) using the stochastic gradient ascent.\nVariational autoencoders (VAEs) [Kingma and Welling, 2014, Rezende et al., 2014] make q\u03c6(\u00b7|x) a\nreparameterizable distribution, where z = g\u03c6(x, \u0001) is a differentiable deterministic function of an\nobservation x and \u0001 \u223c d(\u00b7) sampled from a known base distribution d(\u00b7). Then, the gradient can\nbe estimated as: (cid:53)(\u03b8,\u03c6)L(q\u03c6, \u03b8; x) = Ed(\u0001)\n, which generally yields a low\nvariance estimator.\nA tighter lower bound is achieved by using multiple samples, z1:L, independently sampled from q\u03c6:\n\n(cid:104)(cid:53)(\u03b8,\u03c6) log p\u03b8(x,g\u03c6(x,\u0001))\n(cid:34)\n\nq\u03c6(g\u03c6(x,\u0001))\n\n(cid:35)\n\n(cid:105)\n\nlog\n\n(7)\nIt is proven that, as L increases, the bounds get tighter, i.e., log p\u03b8(x) \u2265 \u00b7\u00b7\u00b7 \u2265 LL+1 \u2265 LL \u2265 \u00b7\u00b7\u00b7 ,\nand the gap eventually vanishes [Burda et al., 2016, Cremer et al., 2017].This multi-sample objec-\ntive (7) is in the class of Monte Carlo objectives (MCO) in the sense that it utilizes independent sam-\nq\u03c6(zl|x) , zl \u223c\nples to estimate the marginal likelihood [Mnih and Rezende, 2016], \u02c6p\u03b8(x) = 1\nL\nq\u03c6(\u00b7|x). De\ufb01ning w\u03b8,\u03c6(x, zl) \u2261 p\u03b8(x,zl)\n\n(cid:80)\ni w\u03b8,\u03c6(x,zi), the gradient of (7) is given by:\n\nq\u03c6(zl|x) and \u02dcwl \u2261 w\u03b8,\u03c6(x,zl)\n\np\u03b8(x,zl)\n\nl=1\n\nl=1\n\n.\n\nLL \u2261 Ez1:L\u223cq\u03c6(\u00b7|x)\n\nL(cid:88)\n\n1\nL\n\np\u03b8(x, zl)\nq\u03c6(zl|x)\n\n(cid:34) L(cid:88)\n\n(cid:80)L\n(cid:35)\n\n\u2207(\u03b8,\u03c6)LL = E\u00011:L\u223cd(\u00b7)\n\n\u02dcwl\u2207(\u03b8,\u03c6) log w\u03b8,\u03c6(x, g\u03c6(x, \u0001l))\n\n.\n\n(8)\n\nl=1\n\nSince the parameter update is averaged over multiple samples with the weights \u02dcwl, the above\nprocedure is referred to as importance weighted autoencoders (IWAEs) [Burda et al., 2016]. The\nperformance of IWAE\u2019s training crucially depends on the variance of the importance weights \u02dcw (or\nequivalently, on the effective sample size), which can be reduced by (i) increasing the number of\nsamples and (ii) decreasing the gap between the proposal and the true posterior distribution; when the\nproposal q\u03c6(\u00b7|x) is equal to the true posterior p\u03b8(\u00b7|x), the variance is reduced to 0, i.e., \u02dcwl = 1/L.\n\n2.3 Semi-Amortized Variational Inference with Iterative Re\ufb01nement\n\nAs mentioned previously, the performance of generative model learning depends on the gap between\nthe variational and the posterior distributions. Thus, the amortized inference has two sources of\nthis gap: the approximation and amortization gaps [Krishnan et al., 2018, Cremer et al., 2018].\nThe approximation gap comes up by using the variational distribution to approximate the posterior\ndistribution, which is given by the KL-divergence between the posterior distribution and the optimal\nvariational distribution. The amortization gap is caused by the limit of the expressive power of\ninference networks, where the variational parameters are not individually optimized for each observa-\ntion but amortized over entire observations. To address the issue of the amortization gap, a hybrid\napproach can be considered; for each observation, the variational distribution is re\ufb01ned individually\nfrom the output of the inference network. Compared to the amortized variational inference, this\nhybrid approach, coined semi-amortized variational inference, allows for utilizing better variational\nparameters in model learning.\n\n3\n\n\f3 Path Integral Adaptation for Variational Inference\n\n3.1 Controlled SDE as variational distribution and structured inference network\n\nWhen handling sequential observations, the variational distribution family should be carefully chosen\nso as to ef\ufb01ciently handle increasing dimensions of variables along the time-axis. In this work,\nthe variational proposal distribution is given by the trajectory distribution of a controlled stochastic\ndynamical system, where the controls, u \u2208 Rdu, and parameters of an initial state distribution, q0,\nserve as variational parameters, i.e., the proposal qu(z[0,T ]) is a probability measure of a following\nsystem:\n\ndz(t) = f (z(t))dt + \u03c3(z(t))(u(t)dt + dw(t)), z(0) \u223c q0(\u00b7).\n\n(9)\nBy applying Girsanov\u2019s theorem in Appendix A that provides the likelihood ratio between p(z[0,T ])\nand qu(z[0,T ]), the ELBO is written as:\n\n(cid:90) T\n\n(cid:90) T\n\n(cid:35)\n\n(cid:34)\n\nL = Equ(z[0,T ])\n\nlog p\u03b8(x1:K|z[0,T ]) + log\n\np0(z(0))\nq0(z(0))\n\n\u2212 1\n2\n\n||u(t)||2dt \u2212\n\n0\n\n0\n\nu(t)T dw(t)\n\n.\n\n(10)\n0 (or equivalently, the best\n\nEqu(z[0,T ])\n\n(cid:34)\nq0(z(0)) \u2212(cid:80)K\n\nThen, the problem of \ufb01nding the optimal variational parameters u\u2217 and q\u2217\napproximate posterior) can be formulated as a SOC problem:\n\n(cid:90) T\n\n(cid:90) T\n\n(cid:35)\n\nu\u2217, q\u2217\n\n0 = argmin\n\nu,q0\n\nV (z[0,T ]) +\n\n1\n2\n\n||u(t)||2dt +\n\n0\n\n0\n\nu(t)T dw(t)\n\n,\n\n(SOC)\n\nk , Kk}k=1,...,K\u22121 as u(t, z(t)) = uf f\n\nwhere V (z[0,T ]) \u2261 \u2212 log p0(z(0))\nk=1 log p\u03b8(xk|z(tk)) serves as a state cost of the SOC problem.\nSuppose that the control policy is discretized along the time-axis with the control parameters\n{uf f\nk \u2212 Kkz(t), \u2200t \u2208 [tk, tk+1), and the initial distribu-\ntion is modeled to be the Gaussian distribution, q0(\u00b7) = N (\u00b7; \u02c6\u00b50, \u02c6\u03a30). Once the inference problem\nis converted into the SOC problem, the principle of optimality [Bellman, 2013] provides the so-\nphisticated and ef\ufb01cient structure of inference networks. Note that, by the principle of optimality,\nthe optimal initial state distribution depends on the cost for all time horizon [0, T ] but the optimal\ncontrol policy at t only relies on the future cost in (t, T ]. Such a structure can be implemented using\na backward recurrent neural network (RNN) to output the approximate optimal control policy; while\nthe hidden states of the backward RNN compress the information of a given observation sequence\nbackward in time, the hidden state at each time step, k = K \u2212 1, ..., 2, outputs the control policy\nk , Kk}. Finally, the \ufb01rst hidden state additionally outputs the initial distribution\nparameters, {uf f\nparameters, {\u02c6\u00b50, \u02c6\u03a30, uf f\n1 , K1}. For the detailed descriptions and illustrations, see Fig. 3(a) and\nAlgorithm 2 in Appendix C.\n\n3.2 Adaptive Path-Integral Autoencoder\n\n(SOC) is in a class of linearly-solvable optimal control problems [Todorov, 2009] of which the\nobjective function can be written as a KL-divergence form:\n\n(cid:0)qu(z[0,T ])||p\u2217(z[0,T ])(cid:1) \u2212 log \u03be,\n\nJ = DKL\n\n(11)\nwhere p\u2217, represented as dp\u2217(z[0,T ]) = exp(\u2212V (z[0,T ]))dp\u03b8(z[0,T ])/\u03be, is a probability measure in-\n\nduced by optimally-controlled trajectories and \u03be \u2261(cid:82) exp(\u2212V (z[0,T ]))dp\u03b8(z[0,T ]) is a normalization\n\nconstant (see Appendix A for details). By applying Girsanov\u2019s theorem again, the optimal trajectory\ndistribution is expressed as:\n\ndp\u2217(z[0,T ]) \u221d dqu(z[0,T ]) exp(cid:0)\u2212Su(z[0,T ])(cid:1) ,\n\n(cid:90) T\n\nu(t)T dw(t).\n\n(12)\n\n(13)\n\nSu(z[0,T ]) = V (z[0,T ]) +\n\n1\n2\n\n||u(t)||2dt +\n\n0\n\n0\n\n(cid:90) T\n\nThis implies that the optimal trajectory distribution can be approximated by sampling a set of\n[0,T ] \u223c qu(\u00b7), and assigning their\ntrajectories according to the controlled dynamics with u(t), i.e. zl\n\n4\n\n\f(cid:88)L\n(cid:88)L\n\n(cid:88)L\n\n(cid:80)L\nexp(\u2212Su(zl\ni=1 exp(\u2212Su(zi\n\n[0,T ]))\n\n[0,T ]))\n\n, \u2200l \u2208 {1, ..., L}. Similar to the MCO\u2019s case, the\nimportance weights as \u02dcwl =\nvariance of importance weights decreases as the control input u(\u00b7) gets closer to the true optimal\ncontrol input u\u2217(\u00b7) and it reduces to 0 when u(t) = u\u2217(t, z(t)) [Thijssen and Kappen, 2015].\nThe path-Integral control is a sampling-based SOC method, which approximates the optimal trajectory\ndistribution, \u02c6p\u2217, with weighted sample trajectories using (12)\u2013(13) and updates control parameters\nbased on moment matching of qu to \u02c6p\u2217. Suppose that \u02c6p\u2217 is approximated with sample trajectories and\ntheir weights, {zl\n[0,T ], \u02dcwl}l=1,...,L, as above and let uf f (t) and K(t) represent feedforward control\nand feedback gain, respectively. This work considers a standardized linear feedback controller to\nregularize the \ufb01rst and second moments of trajectory distributions, where a control input has a form\nas:\n\nu(t) = uf f (t) + K(t)\u03a3\u22121/2(t)(z(t) \u2212 \u00b5(t)),\n\n(14)\nl=1 \u02dcwl(zl(t) \u2212 \u00b5(t))(zl(t) \u2212 \u00b5(t))T are the mean and\ncovariance of the state w.r.t. \u02c6p\u2217, respectively. Suppose a new set of trajectories and their weights is\nobtained by a (previous) control policy u(t) = \u00afuf f (t) + \u00afK(t) \u00af\u03a3\u22121/2(t)(z(t) \u2212 \u00af\u00b5(t)). Then, the path\nintegral control theorem in Appendix B gives the update rules as:\nuf f (t)dt = \u00afuf f (t)dt + \u00afK(t) \u00af\u03a3\u22121/2(t)(\u00b5(t) \u2212 \u00af\u00b5(t))dt + \u03b7\nK(t)dt = \u00afK(t) \u00af\u03a3\u22121/2(t)\u03a31/2(t)dt + \u03b7\n\n(16)\nwith the adaptation rate \u03b7. The initial state distribution also can be updated into q0(\u00b7) = N (\u00b7; \u02c6\u00b50, \u02c6\u03a30):\n\n\u03a3\u22121/2(t)(zl(t) \u2212 \u00b5(t))\n\nl=1 \u02dcwlzl(t) and \u03a3(t) =(cid:80)L\n\nwhere \u00b5(t) =(cid:80)L\n\n(cid:88)L\n\n\u02dcwldwl(t),\n\n\u02dcwldwl(t)\n\n(cid:17)T\n\n(cid:16)\n\n(15)\n\nl=1\n\nl=1\n\n,\n\nl=1\n\nl=1\n\n1\nL\n\n\u02c6\u00b50 =\n\n\u02c6LL = log\n\n(cid:88)L\n\n\u02dcwlzl(0), \u02c6\u03a30 =\n\n[0,T ])), \u2207\u03b8,\u03c6 \u02c6LL = \u2212(cid:88)L\n\n\u02dcwl(zl(0) \u2212 \u02c6\u00b50)(zl(0) \u2212 \u02c6\u00b50)T .\n(17)\n1:K\u22121, K1:K\u22121}, given by the inference network\nStarting from the variational parameters, {\u02c6\u00b50, \u02c6\u03a30, uf f\nand \u00af\u00b5(t) = 0, \u00af\u03a3(t) = I, the update rules in (15)-(17) gradually re\ufb01ne the parameters of qu in order\nfor the resulting trajectory distribution to be close to the posterior distribution. After R adaptations,\nthe MCO and its gradient are estimated by:\nexp(\u2212Su(zl\n\n(18)\nwhere \u03b8 and \u03c6 denote the parameters of the generative model, i.e., f (z), \u03c3(z), p0(z) and p(x|z), and\nthe inference network, i.e., the backward RNN, respectively. Because all procedures in the path\nintegral adaptation and MCO construction are differentiable, they can be implemented by a fully\ndifferentiable network with R recurrences, which we named Adaptive Path Integral Autoencoder\n(APIAE); see also Fig. 3(b) in the Appendix C.\nNote that the inference, reconstruction, and gradient backpropagation of APIAE can operate indepen-\ndently for each of L samples. Consequently, the computational cost grows linearly with the number\nof samples, L, and the number of adaptations, R. As implemented in IWAE [Burda et al., 2016], we\nreplicated each observation data L times and the whole operations were parallelized with GPU. We\nimplemented APIAE with Tensor\ufb02ow [Abadi et al., 2016]; the pseudo code and algorithmic details\nof APIAE are given in the Appendix C.\n\n\u02dcwl\u2207\u03b8,\u03c6Su(zl\n\n[0,T ]),\n\nl=1\n\nl=1\n\n4 High-dimensional Motion Planning with Learned Latent Model\n\nHigh-dimensional motion planning is a challenging problem because of the curse of dimensionality:\nThe size of the con\ufb01guration space exponentially increases with the number of dimensions. However,\nlike in the latent variable model learning, it might be a reasonable assumption that con\ufb01gurations\na planning algorithm really needs to consider form some sort of low-dimensional manifold in the\ncon\ufb01guration space [Vernaza and Lee, 2012], and the learned generative model provides stochastic\ndynamics in that manifold. Once this low-dimensional representation is obtained, any motion planning\nalgorithm can solve high-dimensional planning problem very ef\ufb01ciently by utilizing it to restrict the\nsearch space.\nMore formally, suppose that the initial con\ufb01guration, x1, and corresponding latent state, z(0), are\ngiven and the cost function, Ck(xk), encodes given task speci\ufb01cations of a planning problem, e.g.,\n\n5\n\n\fdesirability/undesirability of certain con\ufb01gurations, a penalty for obstacle collision, etc. Then, the\nplanning problem can be converted into the problem of \ufb01nding the optimal trajectory distribution, qu,\nthat minimizes the following objective function:\n\n(cid:35)\n\n(cid:34) K(cid:88)\n\nk=1\n\nJ(qu) = Ex1:K\u223cp\u03b8(\u00b7|z[0,T ]),z[0,T ]\u223cqu(\u00b7)\n\nCk(xk) + DKL(qu(z[0,T ])||p\u03b8(z[0,T ]))\n\n.\n\n(19)\n\nThat is, we want to \ufb01nd parameters, u, of the trajectory distribution which not only is likely\nto generate sample con\ufb01guration sequences achieving the lower planning cost but also does\nnot deviate a lot from the (learned) prior, p\u03b8(z[0,T ]). The solution can be found using the\naforementioned adaptive path integral control method, where its state cost function is set as:\nV (z[0,T ]) \u2261 Ep\u03b8(x1:K|z[0,T ])\nand the initial state distribution is not updated in the\nadaptation process. After the adaptations with this state cost function, the resulting plan can simply be\nsampled from the generative model, e.g., x1:K \u223c p\u03b8(\u00b7|\u00b5[0,T ]). Note that the time interval tk \u2212 tk\u22121\nand the trajectory length K can differ in the training and planning phases because continuous-time\ndynamics is dealt with.\n\n(cid:104)(cid:80)K\n\nk=1 Ck(xk)\n\n(cid:105)\n\n5 Related Work\nTo address the complexity raised from temporal structures of data, several approaches that build a\nsophisticated approximate inference model have been proposed. For example, Karl et al. [2017] used\nthe locally linear latent dynamics by introducing transition parameters, where an inference model\ninfers transition parameters rather than latent states from the local transition. Johnson et al. [2016]\ncombined a structured graphical model in latent space with a deep generative network, where an\ninference network produces local evidence potentials for the message passing algorithms. Fraccaro\net al. [2017] constructed two layers of latent models, where linear-Gaussian dynamical systems\ngoverned two latent layers and the observation at each time step was related to the middle layer\nindependently; the inference model in this framework consists of independent VAE\u2019s inference\nnetworks at each time-step and the Kalman smoothing algorithm along the time axis. Finally, deep\nKalman smoother (DKS) in [Krishnan et al., 2017] parameterized the dynamical system by a deep\nneural network and built an inference network as it has the same structure with the factorized posterior\ndistribution. The idea of MCOs was also used in the temporal setting. Maddison et al. [2017], Le\net al. [2018], Naesseth et al. [2018] adapted the particle \ufb01lter (PF) algorithm as their inference models\nand utilized a PF\u2019s estimator of the marginal likelihood as an objective function of training which\nMaddison et al. [2017] named the \ufb01ltering variational objectives (FIVOs).\nThese approaches can be viewed as attempts to reduce the approximation gap; by building the infer-\nence model in sophisticated ways that exploit underlying structure of data, the resulting variational\nfamily could \ufb02exibly approximate the posterior distribution. To overcome the amortization gap caused\nby inference networks, the semi-amortized method utilizes an iterative re\ufb01nement procedure for\nimproving variational distribution. Let q\u03c6 and q\u2217 be the variational distributions from the inference\nnetwork and from the re\ufb01nement procedure, i.e., before and after the re\ufb01nement, respectively. Hjelm\net al. [2016] adopted adaptive importance sampling to re\ufb01ne the variational parameters, and the\ngenerative and inference networks are trained separately with (cid:53)\u03b8L(q\u2217, \u03b8; x) and (cid:53)\u03c6DKL(q\u2217||q\u03c6),\nrespectively. Krishnan et al. [2018] used stochastic variational inference as a re\ufb01nement proce-\ndure, and the generative and inference networks are also trained separately with (cid:53)\u03b8L(q\u2217, \u03b8; x) and\n(cid:53)\u03c6L(q\u03c6, \u03b8; x), respectively. Kim et al. [2018] also used stochastic variational inference but proposed\nthe end-to-end training by allowing the learning signals to be backpropagated into the re\ufb01nement\nprocedure, and showed this end-to-end training outperformed the separate training.\nThis work presents a semi-amortized variational inference method for temporal data. In summary, we\nparameterize the variational distribution by control input and transformed the approximate inference\ninto the SOC problem. Our method utilizes the structured inference network based on the principle\nof optimality which has a similar structure to the inference network of DKS [Krishnan et al., 2017].\nThe adaptive path-integral control method, which can be viewed as adaptive importance sampling\nin trajectory space [Kappen and Ruiz, 2016], is then adopted as a re\ufb01nement procedure. Ruiz and\nKappen [2017] also used the adaptive path integral approach to solve smoothing problems and showed\nthe path integral-based smoothing method could outperform the PF-based smoothing algorithms.\nFinally, by observing all procedures of the path integral smoothing are differentiable, the inference\nand generative networks are trained in the end-to-end manner. Note that APIAE is not the \ufb01rst\n\n6\n\n\falgorithm that implements an optimal planning/control algorithm into a fully-differentiable network.\nIn [Tamar et al., 2016, Okada et al., 2017, Karkus et al., 2017], similar iterative re\ufb01nement procedures\nwere built as differentiable networks to learn solutions of control problems in an end-to-end manner;\nthe fact that iterative methods were generally used to solve control problems can be a rationale for\nutilizing re\ufb01nement to approximate inference for sequential data.\nIn addition, there is a non-probabilistic branch of representation learning of dynamical systems, e.g.,\n[Watter et al., 2015, Banijamali et al., 2018, Jonschkowski and Brock, 2015, Lesort et al., 2018].\nThey basically stack two consecutive observations to contain the temporal information and learn\nthe dynamical model based on a carefully designed loss function considering the stacked data as\none observation. As shown in Appendix D, however, when the observations are highly-noisy (or\neven worse, when the system is unobservable with the stacked data), stacking a small number of\nobservations prohibits the training data from containing enough temporal information for learning\nrich generative models.\nLastly, there have been some recent works to utilize a low-dimensional latent model for motion\nplanning. Chen et al. [2016] exploited the idea of VAEs to embed dynamic movement primitives into\nthe latent space. In [Ha et al., 2018], Gaussian process dynamical models [Wang et al., 2008] served\nas a latent dynamical model and was utilized for planning in a similar way with this work. Though\nthe dynamics were not considered, Ichter et al. [2018], Zhang et al. [2018] used the conditional VAEs\nto learn a non-uniform sampling methodology of a sampling-based motion planning algorithm.\n\n6 Experiment\n\nIn our experiments, we would like to show that the proposed method is a complementary technique to\nthe existing methods; the APIAE can play a role in constructing more expressive posterior distribution\nby re\ufb01ning the variational distribution from the existing approximate inference methods. To support\nour statement, we built APIAEs upon the FIVO and IWAE frameworks and compared with the model\nwithout adaptation procedures.\nWe set our APIAE parameters as L=8, R=4, and K=10 during experiments. Quantitative studies\nabout the effect of varying these parameters are discussed in the appendix. Feedback gain is only\nused for the planning, since matrix inversion in (16) requires Cholesky decomposition which is often\nnumerically unstable during the training. We would refer the readers to the Appendix D and the\nsupplementary video for more experimental details and results.\n\n6.1 Dynamic Pendulum\n\nThe \ufb01rst experiment addresses the system identi\ufb01cation and planning of inverted pendulum with\nthe raw images. The pendulum dynamics is represented by the second order differential equation\nfor angle of the pendulum, \u03c8, as \u00a8\u03c8 = \u22129.8 sin(\u03c8) \u2212 \u02d9\u03c8. We simulated the pendulum dynamics by\ninjecting the disturbance from random initial states and then made sequences of 16\u00d7 16 sized images\ncorresponding to the pendulum state with the time interval, \u03b4t = 0.1. This set of sequence images\nwas training data of APIAE, i.e., xk lied in 256-dimensional observation space. 3000 and 500 data\nare used for training and test, respectively.\nFig. 1(a) shows the constructed 2-dimensional latent space; each point represents the posterior mean\nof the observation data and it is shown that the angle and the angular velocity are well-encoded in\n2-dimensional space. As shown in Fig. 1(b), the learned dynamical model was able to successfully\nreconstruct the noisy observations, predict and plan the future images. For the planning, the cost\nfunctions were set to penalize the difference between the last image of the generated sequence and\nthe target image in Fig. 1(c) to encode planning problems for swing-up, -down, -left, and -right.\n\n6.2 Human Motion Capture Data\n\nThe second experiment addresses a motion planning of a humanoid robot with 62-dimensional\ncon\ufb01guration space. We utilized human motion capture data from the Carnegie Mellon University\nmotion capture (CMU mocap) database for the learning; the training data was a set of (short)\nlocomotion, e.g., for standing, walking, and turning. The 62-dimensional con\ufb01gurations consist\nof angles of all joints, roll and pitch angles, vertical position of the root, yaw rate of the root,\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Pendulum results. (a) The inferred latent states colored by angles (top) and angular\nvelocities (bottom) of the ground truth. (b) Resulting image sequences. From the top: images of\nground truth, prediction, and four plaining results for swing-up, -down, -left, and -right, respectively.\nExcept the \ufb01rst row, the images before the red line (k \u2264 10) are reconstructed one. (c) The target\nimages for each task: CK = ||xtarget \u2212 xK||2.\n\nand horizontal velocity of the root. The global (horizontal) position and heading orientation are\nnot encoded in the generative model (only velocities are encoded), but they can be recovered by\nintegration when an observation sequence is given. The original data were written at 120 Hz, and we\ndown-sampled them to 20 Hz and cut them every 10 time steps, i.e., \u03b4t = 0.05, K = 10. 1043 and\n173 data are used for training and test, respectively. We utilized the DeepMind Control Suite [Tassa\net al., 2018] for parsing the data and visualizing the results.\nFigs. 2(a-c) illustrate the posterior mean states of the training data colored by some physical quantities\nof the ground truth; we can observe that (a) locomotion is basically embedded along the surface of\nthe cylinder, while (b) they were arranged in the order of the yaw rates along the major axis of the\ncylinder and (c) motions with lower forward velocities were embedded into smaller radius cycles.\nAlso, Fig. 2(d) shows that APIAE successfully reconstructed the data. Compared to the pendulum\nexample, where the Wiener process in latent dynamics models disturbance into the system and the\nprediction can be made simply by ignoring the disturbance, the framework in this example uses the\nWiener process to model the uncertainty in human\u2019s decision, e.g., whether to turn left or right, to\nincrease or decrease their speed, etc, similar to the modeling of the bounded rationality [Genewein\net al., 2015] or the maximum entropy IRL [Ziebart et al., 2008]; as shown in Fig. 2(e), from the\nvery same initial pose, the framework predicts multiple future con\ufb01gurations for, e.g., going straight,\nturning left or right (the ratio between motions eventually matches that of the training dataset) and\nthese predictions play essential roles in the planning. We then formulated planning problems, where\nthe cost function penalized collision with an obstacle, large yaw rate, and distance from the goal.\nFigs. 2(f-g) show that the proposed method successfully generated the natural and collision-free\nmotion toward the goal.\n\n6.3 Quantitative Results\n\nIt is easily thought that powerful inference methods via resampling or re\ufb01nements make the bound\ntighter, but achieving a tighter bound during learning does not directly imply a better model learn-\ning [Rainforth et al., 2018]. To investigate this, we have compared the lower bound, the reconstruction\nand prediction abilities of the models learned by the proposed and baseline algorithms. The results\nare reported in Table 1 (higher is better).4 Interestingly, we can observe that learning with both the\nresampling and path-integral re\ufb01nements resulted in the best reconstruction ability as well as the\ntightest bound, but the best prediction was achieved by the model learned only with the re\ufb01nements. It\nimplies that while powerful inference can lead to a tighter bound and a good reconstruction, a bias in\nthe gradients can prevent the resulting model from being accurate (note that the gradient components\nfrom the resampling are generally ignored because it causes high variance of the gradient estimator\n[Maddison et al., 2017, Le et al., 2018, Naesseth et al., 2018]). In the planning side, the prediction\npower is crucial because the (learned) generative model needs to sample meaningful and diverse\ncon\ufb01guration sequences. We conclude that the resampling procedure would be better to utilize only\n\n4Mocap prediction is omitted, because a proper measure for the prediction is unclear.\n\n8\n\n-10-50510-15-10-5051015-3-2-10123-10-50510-15-10-5051015-10-505109\ud835\udc58 =1254050\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\nFigure 2: Mocap results. The learned latent space colored by (a) the gait phase, (b) yaw rate, and (c)\nforward velocity of the ground truth. We set the phase as 0 when the left foot touch the ground and\nas \u03c0 when the right foot touch the ground. (d) Reconstruction. (e) Prediction results from the same\ninitial poses. (f-g) Locomotion planning results.\n\nfor planning, not for learning, and this also would be the same in other application domains like\n3-dimensional human motion tracking, where the prediction ability is more important.\n\nTable 1: Comparison of the lower bound, reconstruction, and prediction. Each model was trained\nwith (i) APIAE with resampling (+r), (ii) APIAE without resampling, (iii) FIVO, and (iv) IWAE. The\nlower bounds are obtained for the training datasets and the reconstruction and prediction results are\nmade for the test datasets; the amounts of the test datasets were around 1/6 of the training datasets.\n\nPendulum (\u00d7106)\nLower-bound Reconstruction\n\nMocap (\u00d7105)\n\nPrediction\n\nLower-bound Reconstruction\n\nAPIAE+r\nAPIAE\nFIVO\nIWAE\n\n-9.866\n-9.927\n-9.890\n-9.974\n\n-1.647\n-1.653\n-1.650\n-1.665\n\n-1.985\n-1.845\n-1.978\n-1.860\n\n-6.665\n-6.680\n-6.687\n-6.683\n\n-1.158\n-1.171\n-1.167\n-1.174\n\n7 Conclusion\n\nIn this paper, a semi-amortized variational inference method for sequential data was proposed. We\nparameterized a variational distribution by control input and transformed an approximate inference\ninto a SOC problem. The proposed framework utilized the structured inference network based on\nthe principle of optimality and adopted the adaptive path-integral control method as a re\ufb01nement\nprocedure. The experiments showed that the re\ufb01nement procedure helped the learning algorithm\nachieve tighter lower bound. Also, it is shown that the valid dynamical model can be identi\ufb01ed from\nsequential raw data and utilized to plan the future con\ufb01gurations.\n\n9\n\n-10100z310z1-100z20-10100123456-10100z310z1-100z20-1010-2-1012-10100z310z1-100z20-10100.20.40.60.81\fAcknowledgments\n\nThis work was supported by the Agency for Defense Development under contract UD150047JD.\n\nReferences\nMart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,\nSanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for large-scale\nmachine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\nErshad Banijamali, Rui Shu, Mohammad Ghavamzadeh, Hung Bui, and Ali Ghodsi. Robust locally-\nlinear controllable embedding. International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 2018.\n\nRichard Bellman. Dynamic programming. Courier Corporation, 2013.\n\nYuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. Interna-\n\ntional Conference on Learning Representations (ICLR), 2016.\n\nNutan Chen, Maximilian Karl, and Patrick van der Smagt. Dynamic movement primitives in latent\nspace of time-dependent variational autoencoders. In International Conference on Humanoid\nRobots (Humanoids), pages 629\u2013636. IEEE, 2016.\n\nChris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting importance-weighted autoencoders.\n\nICLR Workshop, 2017.\n\nChris Cremer, Xuechen Li, and David Duvenaud. Inference suboptimality in variational autoencoders.\n\narXiv preprint arXiv:1801.03558, 2018.\n\nMarco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A disentangled recognition\nand nonlinear dynamics model for unsupervised learning. In Advances in Neural Information\nProcessing Systems (NIPS), pages 3604\u20133613, 2017.\n\nCrispin W Gardiner et al. Handbook of stochastic methods, volume 4. Springer Berlin, 1985.\n\nTim Genewein, Felix Leibfried, Jordi Grau-Moya, and Daniel Alexander Braun. Bounded rationality,\nabstraction, and hierarchical decision-making: An information-theoretic optimality principle.\nFrontiers in Robotics and AI, 2:27, 2015.\n\nJung-Su Ha, Hyeok-Joo Chae, and Han-Lim Choi. Approximate inference-based motion planning\nby learning and exploiting low-dimensional latent variable models. In Robotics and Automation\nLetters (RA-L/IROS\u201918). IEEE, 2018.\n\nDevon Hjelm, Ruslan R Salakhutdinov, Kyunghyun Cho, Nebojsa Jojic, Vince Calhoun, and Junyoung\nChung. Iterative re\ufb01nement of the approximate posterior for directed belief networks. In Advances\nin Neural Information Processing Systems (NIPS), pages 4691\u20134699, 2016.\n\nBrian Ichter, James Harrison, and Marco Pavone. Learning sampling distributions for robot motion\n\nplanning. International Conference on Robotics and Automation (ICRA), 2018.\n\nMatthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta.\nComposing graphical models with neural networks for structured representations and fast inference.\nIn Advances in neural information processing systems (NIPS), pages 2946\u20132954, 2016.\n\nRico Jonschkowski and Oliver Brock. Learning state representations with robotic priors. Autonomous\n\nRobots, 39(3):407\u2013428, 2015.\n\nHilbert Johan Kappen and Hans Christian Ruiz. Adaptive importance sampling for control and\n\ninference. Journal of Statistical Physics, 162(5):1244\u20131266, 2016.\n\nPeter Karkus, David Hsu, and Wee Sun Lee. Qmdp-net: Deep learning for planning under partial\nobservability. In Advances in Neural Information Processing Systems (NIPS), pages 4697\u20134707,\n2017.\n\n10\n\n\fMaximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational\nbayes \ufb01lters: Unsupervised learning of state space models from raw data. International Conference\non Learning Representations (ICLR), 2017.\n\nYoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. Semi-amortized\n\nvariational autoencoders. arXiv preprint arXiv:1802.02550, 2018.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference on\n\nLearning Representations (ICLR), 2014.\n\nRahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state\n\nspace models. In AAAI, pages 2101\u20132109, 2017.\n\nRahul G Krishnan, Dawen Liang, and Matthew Hoffman. On the challenges of learning with inference\nnetworks on sparse, high-dimensional data. International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), 2018.\n\nTuan Anh Le, Maximilian Igl, Tom Jin, Tom Rainforth, and Frank Wood. Auto-encoding sequential\n\nmonte carlo. International Conference on Learning Representations (ICLR), 2018.\n\nTimoth\u00e9e Lesort, Natalia D\u00edaz-Rodr\u00edguez, Jean-Fran\u00e7ois Goudou, and David Filliat. State representa-\n\ntion learning for control: An overview. arXiv preprint arXiv:1802.04181, 2018.\n\nChris J Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy\nMnih, Arnaud Doucet, and Yee Whye Teh. Filtering variational objectives. In Advances in neural\ninformation processing systems (NIPS), 2017.\n\nAndriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In International\n\nConference on Machine Learning (ICML), pages 2188\u20132196, 2016.\n\nChristian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational\nIn International Conference on Arti\ufb01cial Intelligence and Statistics\n\nsequential monte carlo.\n(AISTATS), 2018.\n\nMasashi Okada, Luca Rigazio, and Takenobu Aoshima. Path integral networks: End-to-end differen-\n\ntiable optimal control. arXiv preprint arXiv:1706.09597, 2017.\n\nTom Rainforth, Adam R Kosiorek, Tuan Anh Le, Chris J Maddison, Maximilian Igl, Frank Wood, and\nYee Whye Teh. Tighter variational bounds are not necessarily better. In International Conference\non Machine Learning (ICML), 2018.\n\nDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\nIn International Conference on Machine\n\napproximate inference in deep generative models.\nLearning (ICML), pages 1278\u20131286, 2014.\n\nHans-Christian Ruiz and Hilbert J Kappen. Particle smoothing for hidden diffusion processes:\nAdaptive path integral smoother. IEEE Transactions on Signal Processing, 65(12):3191\u20133203,\n2017.\n\nAviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 2154\u20132162, 2016.\n\nYuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden,\nAbbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind Control Suite. arXiv preprint\narXiv:1801.00690, 2018.\n\nSep Thijssen and HJ Kappen. Path integral control and state-dependent feedback. Physical Review E,\n\n91(3):032104, 2015.\n\nEmanuel Todorov. General duality between optimal control and estimation. In IEEE Conference on\n\nDecision and Control, pages 4286\u20134292. IEEE, 2008.\n\nEmanuel Todorov. Ef\ufb01cient computation of optimal actions. Proceedings of the national academy of\n\nsciences, 106(28):11478\u201311483, 2009.\n\n11\n\n\fPaul Vernaza and Daniel D Lee. Learning and exploiting low-dimensional structure for ef\ufb01cient\nholonomic motion planning in high-dimensional spaces. The International Journal of Robotics\nResearch, 31(14):1739\u20131760, 2012.\n\nJack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human\nmotion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283\u2013298, 2008.\n\nManuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control:\nIn Advances in neural\n\nA locally linear latent dynamics model for control from raw images.\ninformation processing systems (NIPS), pages 2746\u20132754, 2015.\n\nClark Zhang, Jinwook Huh, and Daniel D Lee. Learning implicit sampling distributions for motion\n\nplanning. arXiv preprint arXiv:1806.01968, 2018.\n\nBrian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse\n\nreinforcement learning. In AAAI, volume 8, pages 1433\u20131438. Chicago, IL, USA, 2008.\n\n12\n\n\f", "award": [], "sourceid": 5353, "authors": [{"given_name": "Jung-Su", "family_name": "Ha", "institution": "KAIST"}, {"given_name": "Young-Jin", "family_name": "Park", "institution": "KAIST"}, {"given_name": "Hyeok-Joo", "family_name": "Chae", "institution": "KAIST"}, {"given_name": "Soon-Seo", "family_name": "Park", "institution": "KAIST"}, {"given_name": "Han-Lim", "family_name": "Choi", "institution": "Korea Advanced Institute of Science and Technology"}]}