{"title": "Softstar: Heuristic-Guided Probabilistic Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2764, "page_last": 2772, "abstract": "Recent machine learning methods for sequential behavior prediction estimate the motives of behavior rather than the behavior itself. This higher-level abstraction improves generalization in different prediction settings, but computing predictions often becomes intractable in large decision spaces. We propose the Softstar algorithm, a softened heuristic-guided search technique for the maximum entropy inverse optimal control model of sequential behavior. This approach supports probabilistic search with bounded approximation error at a significantly reduced computational cost when compared to sampling based methods. We present the algorithm, analyze approximation guarantees, and compare performance with simulation-based inference on two distinct complex decision tasks.", "full_text": "Softstar: Heuristic-Guided Probabilistic Inference\n\nMathew Monfort\n\nComputer Science Department\nUniversity of Illinois at Chicago\n\nChicago, IL 60607\n\nmmonfo2@uic.edu\n\nBrenden M. Lake\n\nCenter for Data Science\nNew York University\nNew York, NY 10003\nbrenden@nyu.edu\n\nBrian D. Ziebart\n\nComputer Science Department\nUniversity of Illinois at Chicago\n\nChicago, IL 60607\n\nbziebart@uic.edu\n\nPatrick Lucey\n\nDisney Research Pittsburgh\n\nPittsburgh, PA 15232\n\npatrick.lucey@disneyresearch.com\n\nJoshua B. Tenenbaum\n\nBrain and Cognitive Sciences Department\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\njbt@mit.edu\n\nAbstract\n\nRecent machine learning methods for sequential behavior prediction estimate the\nmotives of behavior rather than the behavior itself. This higher-level abstraction\nimproves generalization in different prediction settings, but computing predic-\ntions often becomes intractable in large decision spaces. We propose the Soft-\nstar algorithm, a softened heuristic-guided search technique for the maximum\nentropy inverse optimal control model of sequential behavior. This approach sup-\nports probabilistic search with bounded approximation error at a signi\ufb01cantly re-\nduced computational cost when compared to sampling based methods. We present\nthe algorithm, analyze approximation guarantees, and compare performance with\nsimulation-based inference on two distinct complex decision tasks.\n\n1\n\nIntroduction\n\nInverse optimal control (IOC) [13], also known as inverse reinforcement learning [18, 1] and inverse\nplanning [3], has become a powerful technique for learning to control or make decisions based on\nexpert demonstrations [1, 20]. IOC estimates the utilities of a decision process that rationalizes an\nexpert\u2019s demonstrated control sequences. Those estimated utilities can then be used in an (optimal)\ncontroller to solve new decision problems, producing behavior that is similar to demonstrations.\nPredictive extensions to IOC [17, 23, 2, 16, 19, 6] recognize the inconsistencies, and inherent sub-\noptimality, of repeated behavior by incorporating uncertainty. They provide probabilistic forecasts\nof future decisions in which stochasticity is due to this uncertainty rather than the stochasticity of\nthe decision process\u2019s dynamics. These models\u2019 distributions over plans and policies can typically\nbe de\ufb01ned as softened versions of optimal sequential decision criteria.\nA key challenge for predictive IOC is that many decision sequences are embedded within large\ndecision processes. Symmetries in the decision process can be exploited to improve ef\ufb01ciency [21],\nbut decision processes are not guaranteed to be (close to) symmetric. Approximation approaches to\nprobabilistic structured prediction include approximate maxent IOC [12], heuristic-guided sampling\n[15], and graph-based IOC [7]. However, few guarantees are provided by these approaches; they are\nnot complete and the set of variable assignments uncovered may not be representative of the model\u2019s\ndistribution.\nSeeking to provide stronger guarantees and improve ef\ufb01ciency over previous methods, we present\nSoftstar, a heuristic-guided probabilistic search algorithm for inverse optimal control. Our approach\n\n1\n\n\fgeneralizes the A* search algorithm [8] to calculate distributions over decision sequences in predic-\ntive IOC settings allowing for ef\ufb01cient bounded approximations of the near-optimal path distribution\nthrough a decision space. This distribution can then be used to update a set of trainable parameters,\n\u03b8, that motivate the behavior of the decision process via a cost/reward function [18, 1, 3, 23].\nWe establish theoretical guarantees of this approach and demonstrate its effectiveness in two set-\ntings: learning stroke trajectories for Latin characters and modeling the ball-handling decision pro-\ncess of professional soccer.\n\n2 Background\n\n2.1 State-space graphs and Heuristic-guided search\n\nIn this work, we restrict our consideration to deterministic planning tasks with discrete state spaces.\nThe space of plans and their costs can be succinctly represented using a state-space graph, G =\n(S,E, cost). With vertices, s \u2208 S, representing states of the planning task and directed edges,\neab \u2208 E, representing available transitions between states sa and sb. The neighbor set of state s,\nN (s), is the set of states to which s has a directed edge and a cost function, cost(s, s(cid:48)), represents\nthe relative desirability of transitioning between states s and s(cid:48).\nThe optimal plan from state s1 to goal state sT is a variable-length sequence of states (s1, s2, . . . , sT )\nforming a path through the graph minimizing a cumulative penalty. Letting h(s) represent the cost\nof the optimal path from state s to state sT (i.e., the cost-to-go or value of s) and de\ufb01ning h(sT ) (cid:44) 0,\nthe optimal path corresponds to a \ufb01xed-point solution of the next state selection criterion [5]:\n\nh(s) = min\n\ns(cid:48)\u2208N (s)\n\nh(s(cid:48)) + cost(s, s(cid:48)),\n\nst+1 = argmin\ns(cid:48)\u2208N (st)\n\nh(s(cid:48)) + cost(st, s(cid:48)).\n\nThe optimal path distance to the start state, d(s), can be similarly de\ufb01ned (with d(s1) (cid:44) 0) as\n\nd(s) = min\n\ns(cid:48):s\u2208N (s(cid:48))\n\nd(s(cid:48)) + cost(s(cid:48), s).\n\n(1)\n\n(2)\n\nDynamic programming algorithms, such as Dijkstra\u2019s [9], search the space of paths through the\nstate-space graph in order of increasing d(s) to \ufb01nd the optimal path. Doing so implicitly considers\nall paths up to the length of the optimal path to the goal.\nAdditional knowledge can signi\ufb01cantly reduce the portion of the state space needed to be explored\nto obtain an optimal plan. For example, A* search [11] explores partial state sequences by expanding\nstates that minimize an estimate, f (s) = d(s) + \u02c6h(s), combining the minimal cost to reach state s,\nd(s), with a heuristic estimate of the remaining cost-to-go, \u02c6h(s). A priority queue is used to keep\ntrack of expanded states and their respective estimates. A* search then expands the state at the top of\nthe queue (lowest f (s)) and adds its neighboring states to the queue. When the heuristic estimate is\nadmissible (i.e. \u02c6h(s) \u2264 h(s) \u2200 s \u2208 S), the algorithm terminates with a guaranteed optimal solution\nonce the best \u201cunexpanded\u201d state\u2019s estimate, f (s), is worse than the best discovered path to the goal.\n\n2.2 Predictive inverse optimal control\n\nMaximum entropy IOC algorithms [23, 22] estimate a stochastic action policy that is most uncertain\nwhile still guaranteeing the same expected cost as demonstrated behavior on an unknown cost func-\ntion [1]. For planning settings with deterministic dynamics, this yields a probability distribution over\nstate sequences that are consistent with paths through the state-space graph, \u02c6P (s1:T ) \u221d e\u2212cost\u03b8(s1:T ),\n\u03b8Tf (st, st+1) is a linearly weighted vector of state-transition features\nwhere cost\u03b8(s1:T ) =\n\nT\u22121(cid:88)\n\nt=1\n\ncombined using the feature function, f (st, st+1), and a learned parameter vector, \u03b8. Calculating the\nmarginal state probabilities of this distribution is important for estimating model parameters. The\nforward-backward algorithm [4] can be employed, but for large state-spaces it may not be practical.\n\n2\n\n\f3 Approach\n\nMotivated by the ef\ufb01ciency of heuristic-guided search algorithms for optimal planning, we de\ufb01ne\nan analogous approximation task in the predictive inference setting and present an algorithm that\nleverages heuristic functions to accomplish this task ef\ufb01ciently with bounded approximation error.\nThe problem being addressed is the inef\ufb01ciency of existing inference methods for reward/cost-based\nprobabilistic models of behavior. We present a method using ideas from heuristic-guided search (i.e.,\nA*) for estimating path distributions through large scale deterministic graphs with bounded approx-\nimation guarantees. This is an improvement over previous methods as it results in more accurate\ndistribution estimations without the complexity/sub-optimality concerns of path sampling and is\nsuitable for any problem that can be represented as such a graph.\nAdditionally, since the proposed method does not sample paths, but instead searches the space as in\nA*, it does not need to retrace its steps along a previously searched trajectory to \ufb01nd a new path to\nthe goal. It will instead create a new branch from an already explored state. Sampling would require\nretracing an entire sequence until this branching state was reached. This allows for improvements in\nef\ufb01ciency in addition to the distribution estimation improvements.\n\n3.1\n\nInference as softened planning\n\nWe begin our investigation by recasting the inference task from the perspective of softened planning\nwhere the predictive IOC distribution over state sequences factors into a stochastic policy [23],\n\n\u03c0(st+1|st) = ehsoft(st)\u2212hsoft(st+1)\u2212\u03b8Tf (st,st+1),\n\naccording to a softened cost-to-go , hsoft(s), recurrence that is a relaxation of the Bellman equation:\n\nhsoft(st) = \u2212 log\n\ne\u2212cost\u03b8(st:T ) = softmin\nst+1\u2208N (st)\n\nst:T \u2208\u039est,sT\n\n(cid:88)\n\n(cid:88)\n\n(3)\n\n(4)\n\n(cid:8)hsoft(st+1) + \u03b8Tf (st, st+1)(cid:9)\n(cid:88)\n\n\u03b1(x) (cid:44) \u2212 log\n\n(cid:8)dsoft(st\u22121) + \u03b8Tf (st\u22121, st)(cid:9) .\n\nwhere \u039est,sT is the set of all paths from st to sT; the softmin, softmin\nsmoothed relaxation of the min function1, and the goal state value is initially 0 and \u221e for others.\nA similar softened minimum distance exists in the forward direction from the start state,\n\nx\n\nx\n\n\u2212\u03b1(x), is a\ne\n\ndsoft(st) = \u2212 log\n\ne\u2212cost\u03b8(s1:t) = softmin\nst\u22121\u2208N (st)\n\ns1:t\u2208\u039es1,st\n\nBy combining forward and backward soft distances, important marginal expectations are obtained\nand used to predict state visitation probabilities and \ufb01t the maximum entropy IOC model\u2019s param-\neters [23]. Ef\ufb01cient search and learning require accurate estimates of dsoft and hsoft values since the\nexpected number of occurrences of the transition from sa to sb under the soft path distribution is:\n\ne\u2212dsoft(sa)\u2212hsoft(sb)\u2212\u03b8Tf (sa,sb)+dsoft(sT).\n\n(5)\n\nThese cost-to-go and distance functions can be computed in closed-form using a geometric series,\n\nB = A(I \u2212 A)\u22121 = A + A2 + A3 + A4 + \u00b7\u00b7\u00b7 ,\n\n(6)\n\nwhere Ai,j = e\u2212cost(si,sj ) for any states si and sj \u2208 S.\nThe (i, j)th entry of B is related to the softmin of all the paths from si to sj. Speci\ufb01cally, the\nsoftened cost-to-go can be written as hsof t(si) = \u2212 log bsi,sT. Unfortunately, the required matrix\ninversion operation is computationally expensive, preventing its use in typical inverse optimal con-\ntrol applications. In fact, power iteration methods used for sparse matrix inversion closely resemble\nthe softened Bellman updates of Equation (4) that have instead been used for IOC [22].\n\n1Equivalently, min\n\n\u03b1(x) + softmin\n\nx\n\nx\n\ntice.\n\nis employed to avoid over\ufb02ow/under\ufb02ow in prac-\n\n(cid:110)\n\n(cid:111)\n\n\u03b1(x) \u2212 min\n\nx\n\n\u03b1(x)\n\n3\n\n\f3.2 Challenges and approximation desiderata\n\nIn contrast with optimal control and planning tasks, softened distance functions, dsoft(s), and cost-\nto-go functions, hsoft(s), in predictive IOC are based on many paths rather than a single (best) one.\nThus, unlike in A* search, each sub-optimal path cannot simply be ignored; its in\ufb02uence must in-\nstead be incorporated into the softened distance calculation (4). This key distinction poses a signi\ufb01-\ncantly different objective for heuristic-guided probabilistic search: Find a subset of paths for which\nthe softmin distances closely approximate the softmin of the entire path set. While we would hope\nthat a small subset of paths exists that provides a close approximation, the cost function weights\nand the structure of the state-space graph ultimately determine if this is the case. With this in mind,\nwe aim to construct a method with the following desiderata for an algorithm that seeks a small\napproximation set and analyze its guarantees:\n\n1. Known bounds on approximation guarantees;\n2. Convergence to any desired approximation guarantee;\n3. Ef\ufb01cienct \ufb01nding small approximation sets of paths.\n\n3.3 Regimes of Convergence\n\nIn A* search, theoretical results are based on the assumption that all in\ufb01nite length paths have\nin\ufb01nite cost (i.e., any cycle has a positive cost) [11]. This avoids a negative cost cycle regime of\nnon-convergence. Leading to a stronger requirement for our predictive setting are three regimes of\nconvergence for the predictive IOC distribution, characterized by:\n\n1. An in\ufb01nite-length most likely plan;\n2. A \ufb01nite-length most likely plan with expected in\ufb01nite-length plans; and\n3. A \ufb01nite expected plan length.\n\nThe \ufb01rst regime results from the same situation described for optimal planning: reachable cycles\nof negative cost. The second regime arises when the number of paths grows faster than the pe-\nnalization of the weights from the additional cost of longer paths (without negative cycles) and is\nnon-convergent. The \ufb01nal regime is convergent.\nAn additional assumption is needed in the predictive IOC setting to avoid the second regime of non-\nconvergence. We assume that a \ufb01xed bound on the entropy of the distribution of paths, H(S1:T ) (cid:44)\nE[\u2212 log P (S1:T )] \u2264 Hmax, is known.\nTheorem 1 Expected costs under the predictive IOC distribution are related to entropy and softmin\npath costs by E[cost\u03b8(S1:T )] = H(S1:T ) \u2212 dsoft(sT).\nTogether, bounds on the entropy and softmin distance function constrain expected costs under the\npredictive IOC distribution (Theorem 1).\n\n3.4 Computing approximation error bounds\n\nA* search with a non-monotonic heuristic function guarantees optimality when the priority queue\u2019s\nminimal element has an estimate dsoft(s) + \u02c6hsoft(s) exceeding the best start-to-goal path cost,\ndsoft(sT). Though optimality is no longer guaranteed in the softmin search setting, approximations\nto the softmin distance are obtained by considering a subset of paths (Lemma 1).\n\nLemma 1 Let \u039e represent the entire set (potentially in\ufb01nite in size) of paths from state s to sT. We\ncan partition the set \u039e into two sets \u039ea and \u039eb such that \u039ea \u222a \u039eb = \u039e and \u039ea \u2229 \u039eb = \u2205 and de\ufb01ne\nsoft as the softmin over all paths in set \u039e. Then, given a lower bound estimate for the distance,\nd\u039e\n\u02c6dsoft(s) \u2264 dsoft(s), we have e\u2212d\u039e\nWe establish a bound on the error introduced by considering the set of paths through a set of states\nS\u2248 in the following Theorem.\nTheorem 2 Given an approximation state subset S\u2248 \u2286 S with neighbors of the approximation set\nN (s) \u2212 S\u2248, the approximation loss of exact search for paths through\n\nde\ufb01ned as N (S\u2248) (cid:44) (cid:91)\n\nsoft(s) \u2212 e\n\n\u039eb\nsoft (s).\n\n\u2212d\u039ea\n\nsoft (s) \u2264 e\n\n\u2212 \u02c6d\n\ns\u2208S\u2248\n\n4\n\n\fS\u2248\n\n\u2212d\n\nthis approximation set (i.e., paths with non-terminal vertices from S\u2248 and terminal vertices from\nS\u2248\nS\u2248 \u222a N (S\u2248)) is bounded by the softmin of the set\u2019s neighbors estimates, e\u2212dsoft(sT) \u2212 e\nsoft (sT) \u2264\nsoft (s)+\u02c6hsoft(s)}, where d\n\u2212 softmins\u2208N (S\u2248 ){d\nS\u2248\nsoft (s) is the softmin of all paths with terminal state s and\ne\nall previous states within S\u2248.\nThus, for a dynamic construction of the approximation set S\u2248, a bound on approximation error can\nbe maintained by tracking the weights of all states in the neighborhood of that set.\nIn practice, even computing the exact softened distance function for paths through a small subset of\nstates may be computationally impractical. Theorem 3 establishes the approximate search bounds\nwhen only a subset of paths in the approximation set are employed to compute the soft distance.\nTheorem 3 If a subset of paths \u039e(cid:48)\nS\u2248 represents a set of paths that\nare pre\ufb01xes for all of the remaining paths within S\u2248) through the approximation set S\u2248 is employed\n(cid:41)(cid:19)\nto compute the soft distance, the error of the resulting estimate is bounded by:\n\nS\u2248 \u2286 \u039eS\u2248 (and \u00af\u039e(cid:48)S\u2248 \u2286 \u039eS\u2248 \u2212 \u039e(cid:48)\n(cid:18)\n(cid:41)\n\n(cid:40)\n\n(cid:40)\n\nsoftmins\u2208N (S\u2248)\n\n(s)+\u02c6hsoft(s)\n\n,softmins\u2208S\u2248\n\n\u039e(cid:48)\nS\u2248\nd\nsoft\n\n\u2212dsoft(sT) \u2212 e\ne\n\n\u2212d\n\n(sT) \u2264 e\n\n\u039e(cid:48)\nS\u2248\nsoft\n\n\u2212 softmin\n\n\u00af\u039e(cid:48)S\u2248\n\nd\n\nsoft\n\n(s)+\u02c6hsoft(s)\n\n.\n\n3.5 Softstar: Greedy forward path exploration and backward cost-to-go estimation\n\nOur algorithm greedily expands nodes by considering the state contributing the most to the approxi-\nmation bound (Theorem 3). This is accomplished by extending A* search in the following algorithm.\n\nAlgorithm 1 Softstar: Greedy forward and approximate backward search with \ufb01xed ordering\nInput: State-space graph G, initial state s1, goal sT, heuristic \u02c6hsoft, and approximation bound \u0001\nOutput: Approximate soft distance to goal h\nSet hsoft(s) = dsoft(s) = fsoft(s) = \u221e \u2200 s \u2208 S, hsoft(sT) = 0, dsoft(s1) = 0 and fsoft(s1) = \u02c6hsoft(s1)\nInsert (cid:104)s1, fsoft(s1)(cid:105) into priority queue P and initialize empty stack O\nwhile softmin\n\nS\u2248\nsoft\n\ns\u2208P\n\n(fsoft(s)) + \u0001 \u2264 dsoft(sT) do\nSet s \u2192 min element popped from P\nPush s onto O\nfor s(cid:48) \u2208 N (s) do\n\nfsoft(s(cid:48)) = softmin(fsoft(s(cid:48)), dsoft(s) +cost(s, s(cid:48))+\u02c6hsoft(s(cid:48)))\ndsoft(s(cid:48)) = softmin(dsoft(s(cid:48)), dsoft(s) +cost(s, s(cid:48)))\n(Re-)Insert (cid:104)s(cid:48), fsoft(s(cid:48))(cid:105) into P\n\nend\n\nend\nwhile O not empty do\nSet s \u2192 top element popped from O\nfor s(cid:48) \u2208 N (s) do\n\nhsoft(s) = softmin(hsoft(s), hsoft(s(cid:48)) + cost(s, s(cid:48)))\n\nend\n\nend\nreturn hsoft\n\nFor insertions to the priority queue, if s(cid:48) already exists in the queue, its estimate is updated to the\nsoftmin of its previous estimate and the new insertion estimate. Additionally, the softmin of all of the\nestimates of elements on the queue can be dynamically updated as elements are added and removed.\nThe queue contains some states that have never been explored and some that have. The former\ncorrespond to the neighbors of the approximation state set and the latter correspond to the search\napproximation error within the approximation state set (Theorem 3). The softmin over all elements\nof the priority queue thus provides a bound on the approximation error of the returned distance mea-\nsure. The exploration order, O, is a stack containing the order that each state is explored/expanded.\nA loop through the reverse of the node exploration ordering (stack O) generated by the forward\nsearch computes complementary backward cost-to-go values, hsoft. The expected number of occur-\n\n5\n\n\frences of state transitions can then be calculated for the approximate distribution (5). The bound\non the difference between the expected path cost of this approximate distribution and the actual\ndistribution over the entire state set is established in Theorem 4.\nTheorem 4 The cost expectation inaccuracy introduced by employing state set S\u2248 is bounded by\n|E[cost\u03b8(S1:T )] \u2212 ES\u2248 [cost\u03b8(S1:T )]| \u2264 e\nwhere: ES\u2248 is the expectation under the approximate state set produced by the algorithm;\n(fsoft(s)) is the softmin of fsoft for all the states remaining on the priority queue after the \ufb01rst\nsoftmin\nwhile loop of Algorithm 1; and EP is the expectation over all paths not considered in the second\nwhile loop (i.e., remaining on the queue). EP is unknown, but can be bounded using Theorem 1.\n\n(fsoft(s))(cid:12)(cid:12)EP [cost\u03b8(S1:T )] \u2212 ES\u2248 [cost\u03b8(S1:T )](cid:12)(cid:12),\n\nS\u2248\nsoft (sT)\u2212softmin\ns\u2208P\n\ns\u2208P\n\nd\n\n3.6 Completeness guarantee\n\nThe notion of monotonicity extends to the probabilistic setting, guaranteeing that the expansion of a\nstate provides no looser bounds than the unexpanded state (De\ufb01nition 1).\n\nis monotonic if and only if \u2200s \u2208 S, \u02c6hsoft(s) \u2265\n\n(cid:110)\u02c6hsoft(s(cid:48)) + cost(s, s(cid:48))\n(cid:111)\n\nDe\ufb01nition 1 A heuristic function \u02c6hsoft\nsoftmin\ns(cid:48)\u2208N (s)\n\n.\n\nAssuming this, the completeness of the proposed algorithm can be established (Theorem 5).\n\nTheorem 5 For monotonic heuristic functions and \ufb01nite softmin distances, convergence to any level\nof softmin approximation is guaranteed by Algorithm 1.\n\n4 Experimental Validation\n\nWe demonstrate the effectiveness of our approach on datasets for Latin character construction using\nsequences of pen strokes and ball-handling decisions of professional soccer players. In both cases we\nlearn the parameters of a state-action cost function that motivates the behavior in the demonstrated\ndata and using the softstar algorithm to estimate the state-action feature distributions needed to\nupdate the parameters of the cost function [23]. We refer to the appendix for more information.\nWe focus our experimental results on estimating state-action feature distributions through large state\nspaces for inverse optimal control as there is a lot of room for improvement over standard approaches\nwhich typically use sampling based methods to estimate the distributions providing few (if any)\napproximation guarantees. Softstar directly estimates this distribution with bounded approximation\nerror allowing for a more accurate estimation and more informed parameter updates.\n\n4.1 Comparison approaches\n\nWe compare our approach to heuristic guided maximum entropy sampling [15], approximate maxi-\nmum entropy sampling [12], reversible jump Markov chain Monte Carlo (MCMC) [10], and a search\nthat is not guided by heuristics (comparable to Dijkstra\u2019s algorithm for planning). For consistency,\nwe use the softmin distance to generate the values of each state in MCMC. Results were collected\non an Intel i7-3720QM CPU at 2.60GHz.\n\n4.2 Character drawing\n\nWe apply our approach to the task of predicting the sequential pen strokes used to draw characters\nfrom the Latin alphabet. The task is to learn the behavior of how a person draws a character given\nsome nodal skeleton. Despite the apparent simplicity, applying standard IOC methods is challeng-\ning due to the large planning graph corresponding to a \ufb01ne-grained representation of the task. We\ndemonstrate the effectiveness of our method against other commonly employed techniques.\n\nDemonstrated data: The data consists of a randomly separated training set of 400 drawn char-\nacters, each with a unique demonstrated trajectory, and a separate test set of 52 examples where\nthe handwritten characters are converted into skeletons of nodes within a unit character frame [14].\n\n6\n\n\fFor example, the character in Figure 1 was drawn using two strokes, red and\ngreen respectively. The numbering indicates the start of each stroke.\n\nState and feature representation: The state consists of a two node history\n(previous and current node) and a bitmap signifying which edges are cov-\nered/uncovered. The state space size is 2|E|(|V | + 1)2 with |E| edges and\n|V | nodes. The number of nodes is increased by one to account for the ini-\ntial state. For example, a character with 16 nodes and 15 edges with has a\ncorresponding state space of about 9.47 million states.\nThe initial state has no nodal history and a bitmap with all uncovered edges. The goal state will have\na two node history as de\ufb01ned above, and a fully set bitmap representing all edges as covered. Any\ntransition between nodes is allowed, with transitions between neighbors de\ufb01ned as edge draws and\nall others as pen lifts. The appendix provides additional details on the feature representation.\n\nFigure 1: Character\nskeleton with two\npen strokes.\n\nHeuristic: We consider a heuristic function that combines the (soft) minimum costs of covering\neach remaining uncovered edge in a character assuming all moves that do not cross an uncovered\nedge have zero cost. Formally, it is expressed using the set of uncovered edges, Eu, and the set of\nall possible costs of traversing edge i, cost(ei), as \u02c6hsoft(s) =\n\n(cid:88)\n\nsoftmin\n\ncost(ei).\n\nei\u2208Eu\n\nei\n\n4.3 Professional Soccer\n\nIn addition, we apply our approach to the task of modeling the discrete spatial decision process of the\nball-handler for single possession open plays in professional soccer. As in the character drawing task,\nwe demonstrate the effectiveness of our approach against other commonly employed techniques.\n\nDemonstrated data: Tracking information from 93 games consisting of player locations and time\nsteps of signi\ufb01cant events/actions were pre-processed into sets of sequential actions in single pos-\nsessions. Each possession may include multiple different team-mates handling the ball at different\ntimes resulting in a team decision process on actions rather than single player actions/decisions.\nDiscretizing the soccer \ufb01eld into cells leads to a very large decision process when considering actions\nto each cell at each step. We increase generalization by reformatting the \ufb01eld coordinates so that the\norigin lies in the center of the team\u2019s goal and all playing \ufb01elds are normalized to 105m by 68m and\ndiscretized into 5x4m cells. Formatting the \ufb01eld coordinates based on the distances from the goal of\nthe team in possession doubles the amount of training data for similar coordinates. The positive and\nnegative half planes of the y axis capture which side of the goal the ball is located on.\nWe train a spatial decision model on 92 of the games and evaluate the learned ball trajectories on a\nsingle test game. The data contains 20,337 training possession sequences and 230 test sequences.\n\nState and feature representation: The state consists of a two action history where an action is\ndesignated as a type-cell tuple where the type is the action (pass, shot, clear, dribble, or cross) and\nthe cell is the destination cell with the most recent action containing the ball\u2019s current location. There\nare 1433 possible actions at each step in a trajectory resulting in about 2.05 million possible states.\nThere are 28 Euclidean features for each action type and 29 that apply to all action types resulting in\n168 total features.We use the same features as the character drawing model and include a different\nset of features for each action type to learn unique action based cost functions.\n\nHeuristic: We use the softmin cost over all possible actions from the current state as a heuristic.\n{cost(s, s(cid:48))}.\nIt is admissible if the next state is assumed to always be the goal: \u02c6hsoft(s) = softmin\ns(cid:48)\u2208N (s)\n\n4.4 Comparison of learning ef\ufb01ciency\n\nWe compare Softstar to other inference procedures for large scale IOC and measure the average test\nset log-loss, equivalent to the difference between the cost of the demonstrated path, cost(s1:T ), and\nthe softmin distance to the goal, dsoft(goal), \u2212 log P (path) = cost(s1:T ) \u2212 dsoft(goal).\n\n7\n\n\fLog-Loss After Each Training Epoch\n\nFigure 2: Training ef\ufb01ciency on the Character (left) and Soccer domains (right).\n\nFigure 2 shows the decrease of the test set log-loss after each training epoch. The proposed method\nlearns the models far more ef\ufb01ciently than both approximate max ent IOC [12] and heuristic guided\nsampling [15]. This is likely due to the more accurate estimation of the feature expectations that\nresults from searching the graph rather than sampling trajectories.\nThe improved ef\ufb01ciency of the proposed method is also evident if we analyze the respective time\ntaken to train each model. Softstar took ~5 hours to train 10 epochs for the character model and ~12\nhours to train 25 epochs for the soccer model. To compare, heuristic sampling took ~9 hours for the\ncharacter model and ~17 hours for the soccer model, and approximate max ent took ~10 hours for\nthe character model and ~20 hours for the soccer model.\n\n4.5 Analysis of inference ef\ufb01ciency\n\nIn addition to evaluating learning ef\ufb01ciency, we compare the average time ef\ufb01ciency for generating\nlower bounds on the estimated softmin distance to the goal for each model in Figure 3.\n\nSoftmin Distance Estimation as a Function of Time\n\nFigure 3: Inference ef\ufb01ciency evaluations for the Character (left) and Soccer domains (right).\n\nThe MCMC approach has trouble with local optima. While the unguided algorithm does not ex-\nperience this problem, it instead explores a large number of improbable paths to the goal. The pro-\nposed method avoids low probability paths and converges much faster than the comparison methods.\nMCMC fails to converge on both examples even after 1,200 seconds, matching past experience with\nthe character data where MCMC proved incapable of ef\ufb01cient inference.\n\n5 Conclusions\n\nIn this work, we extended heuristic-guided search techniques for optimal planning to the predictive\ninverse optimal control setting. Probabilistic search in these settings is signi\ufb01cantly more computa-\ntionally demanding than A* search, both in theory and practice, primarily due to key differences be-\ntween the min and softmin functions. However, despite this, we found signi\ufb01cant performance im-\nprovements compared to other IOC inference methods by employing heuristic-guided search ideas.\n\nAcknowledgements\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n#1227495, Purposeful Prediction: Co-robot Interaction via Understanding Intent and Goals.\n\n8\n\nAverage Test Log-Loss510152025303540Training Epoch0246810Approximate Max EntHeuristic Max EntSoftStarAverage Test Log-Loss50100150200250Training Epoch0510152025Approximate Max EntHeuristic Max EntSoftStarEstimated Softmin Distance050100150Seconds020406080100MCMCApproximate Max EntHeuristic Max EntSoft StarEstimated Softmin Distance050100150200Seconds020406080100MCMCApproximate Max EntHeuristic Max EntSoftstar\fReferences\n[1] Peter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceed-\n\nings International Conference on Machine Learning, pages 1\u20138, 2004.\n\n[2] Monica Babes, Vukosi Marivate, Kaushik Subramanian, and Michael L Littman. Apprenticeship learning\n\nabout multiple intentions. In International Conference on Machine Learning, 2011.\n\n[3] Chris L. Baker, Joshua B. Tenenbaum, and Rebecca R. Saxe. Goal inference as inverse planning.\n\nConference of the Cognitive Science Society, 2007.\n\nIn\n\n[4] Leonard E Baum. An equality and associated maximization technique in statistical estimation for proba-\n\nbilistic functions of markov processes. Inequalities, 3:1\u20138, 1972.\n\n[5] Richard Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, 6:679\u2013684,\n\n1957.\n\n[6] Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learning. In Pro-\n\nceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, pages 182\u2013189, 2011.\n\n[7] Arunkumar Byravan, Mathew Monfort, Brian Ziebart, Byron Boots, and Dieter Fox. Graph-based inverse\noptimal control for robot manipulation. In Proceedings of the International Joint Conference on Arti\ufb01cial\nIntelligence, 2015.\n\n[8] Rina Dechter and Judea Pearl. Generalized best-\ufb01rst search strategies and the optimality of a*. J. ACM,\n\nJuly 1985.\n\n[9] Edsger W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1959.\n[10] Peter J. Green. Reversible jump markov chain monte carlo computation and bayesian model determina-\n\ntion. Biometrika, 82:711\u2013732, 1995.\n\n[11] Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of\n\nminimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4:100\u2013107, 1968.\n\n[12] De-An Huang, Amir massoud Farahman, Kris M. Kitani, and J. Andrew Bagnell. Approximate maxent\n\ninverse optimal control and its application for mental simulation of human interactions. In AAAI, 2015.\n\n[13] Rudolf E. Kalman. When is a linear control system optimal? Trans. ASME, J. Basic Engrg., 86:51\u201360,\n\n1964.\n\n[14] Brenden M Lake, Ruslan Salakhutdinov, and Josh Tenenbaum. One-shot learning by inverting a compo-\n\nsitional causal process. In NIPS, 2013.\n\n[15] Mathew Monfort, Brenden M. Lake, Brian D. Ziebart, and Joshua B. Tenenbaum. Predictive inverse\nIn ICML Workshop on Robot\n\noptimal control in large decision processes via heuristic-based search.\nLearning, 2013.\n\n[16] Mathew Monfort, Anqi Liu, and Brian Ziebart. Intent prediction and trajectory forecasting via predictive\n\ninverse linear-quadratic regulation. In AAAI, 2015.\n\n[17] Gergely Neu and Csaba Szepesv\u00b4ari. Apprenticeship learning using inverse reinforcement learning and\n\ngradient methods. In Proceedings UAI, pages 295\u2013302, 2007.\n\n[18] Andrew Y. Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In Proceedings Inter-\n\nnational Conference on Machine Learning, 2000.\n\n[19] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In Proceedings Inter-\n\nnational Joint Conferences on Arti\ufb01cial Intelligence, pages 2586\u20132591, 2007.\n\n[20] Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. Maximum margin planning. In Proceed-\n\nings International Conference on Machine Learning, pages 729\u2013736, 2006.\n\n[21] Paul Vernaza and Drew Bagnell. Ef\ufb01cient high dimensional maximum entropy modeling via symmetric\n\npartition functions. In Advances in Neural Information Processing Systems, pages 575\u2013583, 2012.\n\n[22] Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling interaction via the principle of maxi-\n\nmum causal entropy. In International Conference on Machine Learning, 2010.\n\n[23] Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse\n\nreinforcement learning. In Association for the Advancement of Arti\ufb01cial Intelligence, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1579, "authors": [{"given_name": "Mathew", "family_name": "Monfort", "institution": "University of Illinois at Chicago"}, {"given_name": "Brenden", "family_name": "Lake", "institution": "MIT"}, {"given_name": "Brenden", "family_name": "Lake", "institution": "New York University"}, {"given_name": "Brian", "family_name": "Ziebart", "institution": "University of Illinois at Chicago"}, {"given_name": "Patrick", "family_name": "Lucey", "institution": "Disney Research Pittsburgh"}, {"given_name": "Josh", "family_name": "Tenenbaum", "institution": "MIT"}]}