{"title": "Transfer of Deep Reactive Policies for MDP Planning", "book": "Advances in Neural Information Processing Systems", "page_first": 10965, "page_last": 10975, "abstract": "Domain-independent probabilistic planners input an MDP description in a factored representation language such as PPDDL or RDDL, and exploit the specifics of the representation for faster planning. Traditional algorithms operate on each problem instance independently, and good methods for transferring experience from policies of other instances of a domain to a new instance do not exist. Recently, researchers have begun exploring the use of deep reactive policies, trained via deep reinforcement learning (RL), for MDP planning domains. One advantage of deep reactive policies is that they are more amenable to transfer learning. \n\nIn this paper, we present the first domain-independent transfer algorithm for MDP planning domains expressed in an RDDL representation. Our architecture exploits the symbolic state configuration and transition function of the domain (available via RDDL) to learn a shared embedding space for states and state-action pairs for all problem instances of a domain. We then learn an RL agent in the embedding space, making a near zero-shot transfer possible, i.e., without much training on the new instance, and without using the domain simulator at all. Experiments on three different benchmark domains underscore the value of our transfer algorithm. Compared against planning from scratch, and a state-of-the-art RL transfer algorithm, our transfer solution has significantly superior learning curves.", "full_text": "Transfer of Deep Reactive Policies for MDP Planning\n\nAniket Bajpai, Sankalp Garg, Mausam\n\nIndian Institute of Technology, Delhi\n\n{quantum.computing96, sankalp2621998}@gmail.com, mausam@cse.iitd.ac.in\n\nNew Delhi, India\n\nAbstract\n\nDomain-independent probabilistic planners input an MDP description in a factored\nrepresentation language such as PPDDL or RDDL, and exploit the speci\ufb01cs of\nthe representation for faster planning. Traditional algorithms operate on each\nproblem instance independently, and good methods for transferring experience\nfrom policies of other instances of a domain to a new instance do not exist. Recently,\nresearchers have begun exploring the use of deep reactive policies, trained via deep\nreinforcement learning (RL), for MDP planning domains. One advantage of deep\nreactive policies is that they are more amenable to transfer learning.\nIn this paper, we present the \ufb01rst domain-independent transfer algorithm for MDP\nplanning domains expressed in an RDDL representation. Our architecture exploits\nthe symbolic state con\ufb01guration and transition function of the domain (available\nvia RDDL) to learn a shared embedding space for states and state-action pairs for\nall problem instances of a domain. We then learn an RL agent in the embedding\nspace, making a near zero-shot transfer possible, i.e., without much training on the\nnew instance, and without using the domain simulator at all. Experiments on three\ndifferent benchmark domains underscore the value of our transfer algorithm. Com-\npared against planning from scratch, and a state-of-the-art RL transfer algorithm,\nour transfer solution has signi\ufb01cantly superior learning curves.\n\n1\n\nIntroduction\n\nThe \ufb01eld of domain-independent planning designs planners that can be run for all symbolic planning\nproblems described in a given input representation. The planners use representation-speci\ufb01c algo-\nrithms, thus allowing themselves to be run on all domains that can be expressed in the representation.\nTwo popular representation languages for expressing probabilistic planning problems are PPDDL,\nProbabilistic Planning Domain Description Language [Younes et al., 2005], and RDDL, Relational\nDynamic In\ufb02uence Diagram Language [Sanner, 2010]. These languages express Markov Decision\nProcesses (MDPs) with potentially large state spaces using a factored representation. Most traditional\nalgorithms for PPDDL and RDDL planning solve each problem instance independently are not able\nto share policies between two or more problems in the same domain [Mausam and Kolobov, 2012].\nVery recently, probabilistic planning researchers have begun exploring ideas from deep reinforcement\nlearning, which approximates state-value or state-action mappings via deep neural networks. Deep\nRL algorithms only expect a domain simulator and do not expect any domain model. Since the\nRDDL or PPDDL models can always be converted to a simulator, every model-free RL algorithm is\napplicable to the MDP planning setting. Recent results have shown competitive performance of deep\nreactive policies generated by RL on several planning benchmarks [Fern et al., 2018, Toyer et al.,\n2018].\nBecause neural models can learn latent representations, they can be effective at ef\ufb01cient transfer from\none problem to another. In this paper, we present the \ufb01rst domain-independent transfer algorithm for\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fMDP planning domains expressed in RDDL. As a \ufb01rst step towards this general research area, we\ninvestigate transfer between equi-sized problems from the same domain, i.e. those with same state\nvariables, but different connectivity graphs.\nWe name our novel neural architecture, TORPIDO \u2013 Transfer of Reactive Policies Independent of\nDomains. TORPIDO brings together several innovations that are well-suited to symbolic planning\nproblems. First, it exploits the given symbolic (factored) state and its connectivity structure to\ngenerate a state embedding using a graph convolutional network [Goyal and Ferrara, 2017]. An RL\nagent learnt over the state embeddings transfers well across problems. One of the most important\nchallenges for our model are the actions, which are also expressed in a symbolic language. The same\nground action name in two problem instances may actually mean different actions, since the state\nvariables over which the action is applied may have different interpretations. As a second innovation,\nwe train an action decoder that learns the mapping from instance-independent state-action embedding\nto an instance-speci\ufb01c ground action. Third, we make use of the given transition function by training\nan instance-independent model of domain transition in the embedding space. This gets transferred\nwell and enables a fast learning of action decoder for every new instance. Finally, as a fourth\ninnovation, we also implement an adversarially optimized instance classi\ufb01er, whose job is to predict\nwhich instance a given embedding is coming from. It helps TORPIDO to learn instance-independent\nembeddings more effectively.\nWe perform experiments across three standard RDDL domains from IPPC, the International Proba-\nbilistic Planning Competition [Grzes et al., 2014]. We compare the learning curves of TORPIDO\nwith A3C, a state of the art deep RL engine [Mnih et al., 2016], and Attend-Adapt-Transfer (A2T),\na state of the art deep RL transfer algorithm [Rajendran et al., 2017]. We \ufb01nd that TORPIDO has\na much superior learning performance, i.e, obtains a much higher reward for the same number of\nlearning steps. Its strength is in near-zero shot learning \u2013 it can quickly train an action decoder for a\nnew instance and obtains a much higher reward than baselines without running any RL or making\nany simulator calls. To summarize,\n\n1. We present the \ufb01rst domain-independent transfer algorithm for symbolic MDP domains\n\nexpressed in RDDL language.\n\n2. Our novel architecture TORPIDO uses the graph structure and transition function present in\nRDDL to induce instance-independent RL, state encoder and other components. These can\nbe easily transferred to a test instance and only an action decoder is learned from scratch.\n\n3. TORPIDO has signi\ufb01cantly superior learning curves compared to existing transfer algo-\nrithms and training from scratch. Its particular strength is its near-zero shot transfer \u2013 transfer\nof policies without running any RL.\n\nWe release the code of TORPIDO for future research.1\n\n2 Background and Related Work\n\n2.1 Reinforcement Learning\n\nde\ufb01ned as the discounted sum of rewards,(cid:80)h\n\nIn its standard setting, an RL agent acts for long periods of time in an uncertain environment and\nwishes to maximize its long-term return. Its dynamics is modeled via a Markov Decision Process\n(MDP), which takes as input a state space S, an action space A, unknown transition dynamics P r,\nand an unknown reward function R [Puterman, 1994]. The agent in state st at time t takes an action\nat to get a reward rt and make a transition to st+1 via its MDP dynamics. The h-step return Rt:t+h is\ni=1 \u03b3i\u22121rt+i. The value function V\u03c0(s) is the expected\n(in\ufb01nite step) discounted return from state s if all actions are selected according to policy \u03c0(a|s).\nThe action value function Q\u03c0(s, a) is the expected discounted return after taking action a in state s\nand then selecting actions according to \u03c0(a|s) thereafter.\nDeep RL algorithms approximate policy [Williams, 1992] or value function [Mnih et al., 2015] or\nboth [Mnih et al., 2016] via a neural network. Our work builds upon the Asynchronous Advantage\nActor-Critic (A3C) algorithm [Mnih et al., 2016], which constructs approximations for both the\n\n1Available at https://github.com/dair-iitd/torpido\n\n2\n\n\f\u2202\n\npolicy (using the \u2018actor\u2019 network) and the value function (using the \u2018critic\u2019 network). The param-\neters of the critic network are adjusted to maximize the expected reward by using the gradient of\nthe \u2018advantage\u2019 function, which measures the improvement of the action over the expected state\nvalue A(s, a) = Q(s, a) \u2212 V (s). Hence the update to the critic network is the expectation of\n\u2202\u03b8 log\u03c0(a|s)(Q\u03c0(s, a; \u03b8) \u2212 V (s; \u03b8)). The actor network maximizes the H-step lookahead reward by\nminimizing the expectation of mean squared loss, (Rt:t+H + \u03b3H V (st+H+1; \u03b8\u2212) \u2212 V (st; \u03b8))2. Here,\nthe optimization is with respect to \u03b8, the cumulative parameters in both actor and critic networks, and\n\u03b8\u2212 are their previous values. Furthermore, many instances of the agent interact in parallel with many\ninstances of the environment, which both accelerates and stabilizes learning in A3C.\nTransfer Learning in Deep RL: Neural models are highly amenable for transfer learning, because\nthey can learn generalized representations. Initial approaches to transfer learning in deep RL involved\ntransferring the value function or policies from the source to the target task. More recent methods\nhave developed these ideas further, for example, by using expert policies from multiple tasks and\ncombining them with source task features to learn an actor-mimic policy [Parisotto et al., 2015],\nor by using a teacher network to propose a curriculum over tasks for effective multi-task learning\n[Matiisen et al., 2017]. A preliminary approach has also used a symbolic front-end for deep RL\n[Garnelo et al., 2016]. We compare our transfer algorithm against a recent algorithm that uses an\nattention mechanism to allow selective transfer and avoid negative transfer [Rajendran et al., 2017].\nRecently, there have also been attempts at performing a zero-shot transfer, i.e., without seeing the\nnew domain. An example is DARLA [Higgins et al., 2017], which leverages the recent work on\ndomain adaptation to learn a domain-independent representation of the state, and learns a policy over\nthis state representation, hence making the learned policy robust to domain shifts. Our work attempts\na near-zero shot learning by learning a good policy with limited learning, and without any RL.\n\n2.2 Probabilistic Planning\n\nPlanning problems are special cases of RL problems in which the transition function and reward\nfunction are known [Mausam and Kolobov, 2012]. In this work, we consider planning problems\nthat model \ufb01nite-horizon discounted reward MDPs with a known initial state [Kolobov et al., 2012].\nThus, our problems take as input (cid:104)S, A, P r, R, H, s0, \u03b3(cid:105), where H is the horizon for the problem.\nProbabilistic planners can use the model to perform a full Bellman backup, i.e., expectation over all\nnext outcomes from an action (e.g., in Value Iteration [Bellman, 1957]), whereas RL agents can only\nbackup from a single sampled next state (e.g., in Q Learning [Sutton and Barto, 1998]).\nFactored MDPs provide a more compact way to represent MDP problems. They decompose a\nstate s into a set of n binary state variables (x1, x2, . . . , xn); the transition function speci\ufb01es the\nchange in each state variable, and the reward function also uses a factored representation. Solving a\n\ufb01nite-horizon factored MDP is EXPTIME-complete, because it can represent an MDP exponential\nstates in polynomial size [Littman, 1997].\nRDDL Reprentation: RDDL describes a factored MDP via objects, predicates and functions. It is a\n\ufb01rst-order representation, i.e., it can be initiated with a different set of objects to construct MDPs from\nthe same domain. A domain has parameterized non-\ufb02uents to represent the part of the state space that\ndoes not change. A planning state needs those state variables that can change via actions or natural\ndynamics. They are represented as parameterized \ufb02uents. The transition function for the system is\nspeci\ufb01ed via (stochastic) functions over next state variables conditioned on current state variables\nand actions. The reward function is also de\ufb01ned in the factored form using the state variables.\nWe illustrate the RDDL language via the SysAdmin domain [Guestrin et al., 2001], which consists\nof a set of K computers connected in a network. Each computer in the network can be shut down\nwith a probability dependent on the ratio of its \u2018on\u2019 neighbours to total number of neighbours. Any\n\u2018off\u2019 computer can randomly switch on with a reboot probability. The agent can take the action\nof rebooting a single computer, or no action at all in each time step. Note that no-op is also a\nvalid action as this domain would evolve even if the agent does not take any action. The reward\nat each timestep is the number of \u2018on\u2019 computers at that timestep. The natural structure of the\nproblem, and the factored nature of the transition and reward function make this problem perfectly\nsuited to be represented in RDDL as follows. Objects: c1, ...cK, the K computers; Non \ufb02uents:\nconnected(cj, ci), whose value is 1 if cj is a neighbor of ci; State \ufb02uents: on(ci), which is 1 if the ith\ncomputer is on; Action \ufb02uents: reboot(ci), which denotes that the agent rebooted the ith computer;\n\n3\n\n\fReward function: \u03a3i[on(ci)]; Transition function: If reboot(ci), then on(cid:48)(ci) = 1, Else if on(ci) then\non(cid:48)(ci) = Bernoulli(a+b\u00d7(1+\u03a3j1[connected(cj, ci)\u2227on(cj)])/(1+\u03a3j1[connected(cj, ci)])),\nElse on(cid:48)(ci) = Bernoulli(d). Here all primed \ufb02uents denote the value at the next time step, and a,\nb, and d are constants modeling the dynamics of the domain.\nDeep Learning for Planning: Value Iteration Networks formalize the idea of running the Value\nIteration algorithm within the neural model [Tamar et al., 2017], however, they operate on the state\nspace, instead of factored space. There have been three recent works on the use of neural architectures\nfor domain-independent factored MDP planning. One work learns deep reactive policies by using a\nnetwork that mimics the local dependency structure in the RDDL representation of the problem [Fern\net al., 2018]. We are also interested in problems speci\ufb01ed in RDDL, but are more focused on transfer\nacross problem instances. There also has been an early attempt on transfer in planning problems,\nfor two speci\ufb01c classical (deterministic) domains of TSP and Sokoban [Groshev et al., 2018]. In\ncontrast, we propose a transfer mechanism that can be used for domain-independent RDDL planning.\nFinally, Action-Schema Nets use layers of propositions and actions for solving and transferring\nbetween goal-oriented PPDDL planning problems [Toyer et al., 2018]. RDDL allows concurrent\nconditional effects, which when converted to PPDDL can lead to exponential blowup in the action\nspace. Therefore, ASNets are not scalable to RDDL domains considered in this paper.\n\n2.3 Graph Convolutional Networks\n\nGraph Convolutional Networks (GCN) generalize convolutional networks to arbitrarily structured\ngraphs [Goyal and Ferrara, 2017]. A GCN layer take as input a feature representation for every node\nin the graph (M \u00d7 DI feature matrix, where M is the number of nodes in the graph, and DI is the\ninput feature dimension), and a representation for the graph structure (an M \u00d7 M adjacency matrix\nA), and produces an output feature representation for every node (in the form of an M \u00d7 DO matrix,\nwhere DO is the output feature dimension). A layer of GCN can be written as F (l+1) = g(F (l), A).\nwhere F (l) and F (l+1) are the feature representations for lth and (l + 1)th layers.\nWe use this propagation rule (from [Kipf and Welling, 2017]) in our work: g(F (l), A)) =\n\u03c3( \u02c6D\u221212 \u02c6A \u02c6D\u221212F (l)W (l)), where \u02c6A = A + I, I being the identity matrix, and \u02c6D is the diago-\nnal node degree matrix of \u02c6A. Intuitively, this propagation rule implies that the feature at a particular\nnode of the (l + 1)th layer is the weighted sum of the features of the node and all its neighbours at the\nlth layer. Furthermore, these weights are shared at all nodes of the graph, similar to how the weights\nof a CNN kernel are shared at all locations of the image. Hence, at each layer, the GCN expands its\nreceptive \ufb01eld at each node by 1. A deep GCN network, i.e., a network consisting of stacked GCN\nlayers, can therefore have a large enough receptive \ufb01eld and construct good feature representations\nfor each node of a graph.\n\n3 Problem formulation\n\nThe transfer learning problem is formulated as follows. We wish to learn the policy of the target prob-\nlem instance PT , where in addition to PT , we are given N source problem instances P1, P2, ..., PN .\nFor this paper, we make the assumption that the state, action spaces and rewards of all problems is\nthe same, even though their initial state, and non-\ufb02uents could be different. For example, computers\nin SysAdmin may be arranged in different topologies (based on different values of non-\ufb02uents\nconnected). Any transfer learning solution will operate in two phases: (1) Learning phase: Learn\npolicies \u03c01, \u03c02, ..., \u03c0N over each source problem instance, and possibly also learn general representa-\ntions that will help in transfer. (2) Transfer phase: Learn the policy \u03c0T for PT using all output of\nthe learning phase.\nAn ideal zero-shot transfer approach will be one where the target instance environment will not even\nbe used during the transfer phase. Two indicators of good transfer are a high pre-train (zero-shot\ntransfer) score, and a more sample-ef\ufb01cient learning compared to a policy learnt from scratch on PT .\n\n4 Transfer Learning framework\n\nOur approach hinges on the claim that there exists a \u2018good\u2019 embedding space for all states, as well as\na \u2018good\u2019 embedding space of all state-action pairs, which is shared by all equi-sized instances of a\n\n4\n\n\fgiven domain. A \u2018good\u2019 state embedding space is a space in which similar states are close together\nand dissimilar states are far apart (similarly for state-action pair embeddings).\nOur neural architecture is shown in Figure 1. Broadly, our architecture has \ufb01ve components: a state\nencoder (SE), an action decoder (SAD), an RL module (RL), a transition module (Tr), and an instance\nclassi\ufb01er (IC). In the training phase, TORPIDO learns instance-independent modules for SE, RL, Tr\nand IC, but an instance-speci\ufb01c SADI (for PI). Its transfer phase operates in two steps. First, using\nthe general SE and Tr models, it learns weights for SADT , the target action decoder. We call this\nnear-zero shot learning, because this can compute a policy \u03c0T for PT without running any RL. Once,\nthis SADT is effectively trained, we transfer all other components and retrain them via RL to improve\nthe policy further for the target instance.\nState Encoder: We leverage the structure of RDDL domains to represent the instance in the form\nof a graph (state variables as nodes, edges between nodes if the respective objects are connected\nvia the non-\ufb02uents in the domain). The state encoder takes the adjacency matrix for the instance\ngraph and the current state as input, and transforms the state to its embedding. We use the Graph\nConvolution Network (GCN) to perform this encoding. The GCN constructs multiple features for\neach node at each layer. Hence, the output of the deep GCN consists of multi-dimensional features at\neach node, which represent embeddings for the corresponding state variables. These embeddings are\nconcatenated to produce the \ufb01nal state embedding, which is the output of the state encoder module.\nRL Module: This deep RL agent takes in a state embedding as input and outputs a policy in the\nform of a state-action embedding. This embedding is an abstract representation of a soft action, i.e., a\ndistribution over actions in the embedding space (which will be further decoded by action encoder).\nWe think of this embedding as representing the pair of state and soft action, instead of just a soft\naction. This is because the same action may have different effects depending on the state, and hence\na standalone action embedding would not make sense. This can be seen as a neural representation\nfor the notion of state-action pair symmetries in RL domains [Anand et al., 2015, 2016]. We use the\nA3C algorithm to learn our RL agent, because it has been shown to be robust and stable [Mnih et al.,\n2016], though other RL variants can easily replace this module. A3C uses a simulator, which can be\neasily created by sampling from the known transition function in the RDDL representation.\nWe note that only the policy network of the A3C agent is shared between instances and operates in the\nembedding space. The value network is different for each instance and operates in the original state\nspace. We did not try to learn a transferable value function as we were ultimately only concerned\nwith the policy, and not the value in the target domain. Hence, in our case, the sole purpose of the\nvalue function is to assist the policy function in learning a good policy.\nAction decoder: The action decoder aims to learn a transformation from the state-action embedding\nto a soft action (probability distribution over actions). However, such a transformation would not be\nwell-de\ufb01ned, as a state-action embedding could correspond to more than one (symmetric) state-action\npairs, and hence more than one corresponding actions. E.g., consider a navigation problem over a\nsquare grid [Ravindran, 2004], with the goal at top-right corner. Its immediate neighbors (the state to\nthe left say s1, and state below, say s2) will be symmetric as they both can reach the goal in one step.\nWe expect them to have the same state-action pair embedding with their respective optimal actions.\nHowever, the optimal actions are \"right\" for s1 and \"up\" for s2.\nTo resolve this ambiguity, we need to input the state as well into the action decoder. The decoder\noutputs a probability distribution over actions \u03c0(s). TORPIDO implements the action decoder using\na fully connected network with a softmax to output a distribution over actions. It is important to\nrealize that we need a separate action decoder for each instance, as the required transformation is\ndifferent for different instances. For example, if the navigation problem is changed so that the goal is\nnow in the lower left corner, all other embeddings may transfer, but the \ufb01nal action output will be\ndifferent (\"down\" and \"left\" for states symmetric to s1 and s2). Action decoder is the only component,\nwhich cannot be directly transferred from source problems to the target problem.\nTransition Transfer Module: To speed up the transfer to the target domain, we additionally learn\na transition module in the learning phase. This module takes in as input the current and next state\nembeddings (s, s(cid:48)), and outputs a soft-action embedding (interpreted as a distribution over actions),\n(cid:80)\nwith the semantics that the output distribution p(a) = P r(s,a,s(cid:48))\na(cid:48) P r(s,a(cid:48),s(cid:48)). I.e., the output embedding\nmaintains which action is more likely to be responsible for the transition from s to s(cid:48). The gold data\nfor training this module can be easily computed via the RDDL representation. Note that the transition\n\n5\n\n\fFigure 1: Model architecture for TORPIDO. The architecture is divided into three phases: training phase,\ntransfer phase (learning SADT ) and transfer phase (full transfer). The \ufb01gure shows training of ith problem \u2013\nthis is replicated N times with shared SE, Tr, RL and IC modules. The same color boxes have same weights\nduring each step. Across different steps the same color boxes signify that they have been initialized from previous\nstep. The red color outline signi\ufb01es that those weights are being trained in that step.\n\nand RL module share both the state and state-action embedding spaces. This novel module allows us\nto quickly learn an action decoder, thus allowing a near-zero shot transfer to take place.\nInstance classi\ufb01er: TORPIDO\u2019s aim is to learn state-action embeddings independent of the instance\n(since they are shared between all instances). To explicitly enforce this condition, we use an idea from\ndomain adaptation [Ganin et al., 2017]. Essentially, as an auxiliary task, we try to learn a classi\ufb01er to\npredict the problem instance, given a state-action embedding. This is done in an adversarial manner,\nso that the model learns to produce such state-action embeddings, that even the best possible classi\ufb01er\nis unable to predict the instance from which they were generated. In the ideal case, at equilibrium,\nthe model learns to produce state-action embeddings which are instance invariant, as even the best\ninstance classi\ufb01er would predict an equal probability over all source instances.\n\n4.1 TORPIDO\u2019s Training & Transfer Phases\n\nLearning Phase: During the learning phase, we learn a policy over each of the N problem instances\nin a multi-task manner by sharing the state encoder and RL module, and by using separate decoders\nfor each instance. We also learn the transition function to predict the distribution of actions given\nconsecutive states. The instance invariance is implemented using a gradient reversal layer, as in [Ganin\net al., 2017]. I.e., the gradients for the instance classi\ufb01cation loss are back-propagated in the standard\nmanner through the instance classi\ufb01er layer, but with their sign reversed in all the layers preceding the\nstate-action embedding (hence enforcing the adversarial objective function of the game described).\nThe loss function for training is a weighted sum of the policy gradient loss of A3C, a cross-entropy\nloss for prediction of the actions given consecutive states (from the transition module), and the\ninstance misclassi\ufb01cation loss, i.e., the cross entropy loss from the instance module with sign reversed.\nThe instance classi\ufb01cation module is trained to minimize the cross-entropy loss between the predicted\ninstance distribution and ground-truth instance. Mathematically, E(\u03b8E, \u03b8D1, ..., \u03b8DN , \u03b8IC, \u03b8T ) =\n\nLc(\u03b8E, \u03b8IC) + \u03bbtr\n\nLtr(\u03b8E, \u03b8Di, \u03b8tr)\n\n(1)\n\ni=1\n\ni=1\n\ni=1\n\nwhere N is the number of training instances, Lp is the loss function of the policy network of\nthe A3C agent, Lc is the cross-entropy loss for the instance classi\ufb01er and Ltr is cross-entropy\nloss for transition module. Here \u03b8E represent the combined parameters of the encoder and RL\nmodule, \u03b8Di, i = 1...N represents the parameters of the decoder module of the ith agent, \u03b8IC\nrepresents the parameters of the instance classi\ufb01er module and \u03b8tr represents parameters of tran-\n\n6\n\nN(cid:88)\n\nN(cid:88)\n\nLp(\u03b8E, \u03b8Di) \u2212 \u03bb\n\nN(cid:88)\n\n\fTable 1: Comparison of TORPIDO against other baselines, stopped at 4 different iteration numbers.\n\nTrain iter\nAlgo\nSys#1\nSys#5\nSys#10\nGame#1\nGame#5\nGame#10\nNavi#1\nNavi#2\nNavi#3\n\n0\n\nA3C A2T TP\n0.23\n0.00\n0.00\n0.26\n0.26\n0.02\n0.34\n0.00\n0.41\n0.00\n0.34\n0.07\n0.00\n0.72\n0.68\n0.00\n0.00\n0.50\n\n0.08\n0.02\n0.06\n0.19\n0.03\n0.05\n0.04\n0.01\n0.01\n\n0.1M\n\nA3C A2T TP\n0.26\n0.01\n0.03\n0.30\n0.33\n0.04\n0.49\n0.04\n0.55\n0.11\n0.49\n0.03\n0.01\n0.72\n0.73\n0.05\n0.03\n0.60\n\n0.09\n0.11\n0.09\n0.22\n0.11\n0.08\n0.04\n0.06\n0.03\n\n0.5M\n\nA3C A2T TP\n0.39\n0.11\n0.08\n0.64\n0.49\n0.08\n0.98\n0.43\n0.77\n0.24\n0.88\n0.08\n0.10\n0.9\n1.0\n0.26\n0.21\n0.71\n\n0.19\n0.17\n0.13\n0.60\n0.17\n0.14\n0.19\n0.45\n0.42\n\n\u221e\n\nA3C A2T TP\n1.0\n0.31\n0.30\n1.0\n1.0\n0.32\n1.0\n0.54\n1.0\n0.40\n1.0\n0.20\n0.22\n1.0\n1.0\n0.55\n0.40\n1.0\n\n0.38\n0.40\n0.33\n0.61\n0.44\n0.20\n0.25\n0.56\n0.40\n\nFigure 2: Learning curves on 1st problem of the three domains: (a) SysAdmin, (b) Game of Life, (c)\nNavigation. In all cases TORPIDO outperforms other baselines by wide margins.\n\nE, \u03b8\u2217\n\nD1, ..., \u03b8\u2217\n\nE, \u03b8\u2217\n\nDi, and \u03b8\u2217\n\nIC that deliver a saddle point of the\nDN ) = argmin{\u03b8E ,\u03b8D1,...,\u03b8DN}E(\u03b8E, \u03b8D1, ..., \u03b8DN , \u03b8IC) and\n\nsition module. We are seeking parameters \u03b8\u2217\nfunctional such that (\u03b8\u2217\n\u03b8\u2217\nIC = argmax {\u03b8IC}E(\u03b8E, \u03b8D1, ..., \u03b8DN , \u03b8IC).\nTransfer Phase: During this phase, the state encoder requires simply inputting the adjacency matrix\nof the target instance and it directly outputs state embeddings for this problem using SE. Since our\nRL agent operates in the embedding space, it is exactly the same for the target instance, and, hence,\nis directly transferred. However, an action decoder needs to be relearnt. For this, we make use of the\nfact that the transition function also operates only in the embedding space, so can also be directly\ntransferred. For the target instance, we try to predict the distribution over actions given consecutive\nstates in the new instance. The weights for the state encoder and transition module remain \ufb01xed,\nwhile the weights for the decoder for the new instance are learned. This decoder can then be directly\nused to transform the state-action embeddings predicted by the RL module into a distribution over\nactions, or a policy for the new instance. Hence, we are able to achieve a near zero-shot transfer, i.e.,\nwithout doing any RL in the new environment and by simply retraining action encoder via transition\ntransfer. After the weights of each module have been initialized as above, TORPIDO follows the\nsame training procedure as in A3C. This generates the learning curve, post the zero-shot transfer.\nIn summary, this architecture leverages the extra information in RDDL domains in two ways: (i) it\nuses the input structure to represent the state as a graph, and uses a GCN to learn a state embedding\n(ii) it uses the transition function (available in the RDDL \ufb01le) to learn the decoder for the new domain.\n\n5 Experiments\n\nWe wish to answer three experimental questions. (1) Does TORPIDO help in transfer to new problem\ninstances? (2) What is the comparison between TORPIDO and other state of the art transfer learning\nframeworks? (3) What is the importance of each component in TORPIDO?\nDomains: We make all comparisons on three different RDDL domains used in IPPC, International\nPlanning Competition 2014 [Grzes et al., 2014] \u2013 SysAdmin, Game of Life and Navigation. We have\nalready described the SysAdmin domain [Guestrin et al., 2001] in background section. The Game\nof Life represents a grid-world cellular automata, where each cell is dead or alive. Each alive cell\ncontinues to live in next step as long as there is no over- or under-population (measured by number\n\n7\n\n\fTable 2: Incremental value of each component of TORPIDO at the start of the training (zero-shot).\n\nA3C+GCN A3C+GCN + SAD A3C+GCN + SAD + IC\n\n0.23\n0.26\n0.26\n0.34\n0.41\n0.34\n0.72\n0.68\n0.50\n\nSys#1\nSys#5\nSys#10\nGame#1\nGame#5\nGame#10\nNavi#1\nNavi#2\nNavi#3\n\n0.01\n0.01\n0.01\n0.03\n0.02\n0.01\n0.05\n0.03\n0.01\n\n0.25\n0.23\n0.22\n0.31\n0.26\n0.24\n0.70\n0.59\n0.27\n\nFigure 3: Incremental value of various components of TORPIDO. Results on 2nd problem of the\nthree domains: (a) SysAdmin, (b) Game of Life, (c) Navigation. In all cases state-action decoder is\ncritical for good performance. The instance classi\ufb01er has marginal bene\ufb01ts.\n\nof adjacent live cells). Additionally, the agent can make any one cell alive in each step. Finally, the\nNavigation domain requires a robot to move in a grid from one location to the other using four actions\n\u2013 up, down, left and right. There is a river in the middle and the robot can drown with a non-zero\nprobability in each location. If it drowns, it restarts from the start state in the next episode and current\nepisode is over.\nAll three of these domains have some spatial structure, but that is implicit in the symbolic description\n(exposed via non-\ufb02uents in RDDL). This makes them ideal choices for a \ufb01rst study of its kind. For\neach domain we perform experiments on three different instances. A higher numbered problem\nroughly corresponds to a problem with much larger state space. For example, SysAdmin1 has 10\ncomputers and SysAdmin10 has 50 computers (effective state space sizes 210 and 250 respectively).\nExperimental Settings and Comparison Algorithms:\nIn the spirit of domain-independent plan-\nning, all hyperparameters are kept constant for all problems in all domains. Our parameters are as\nfollows. A3C\u2019s value network as well as policy network use two GCN layers (3, 7 feature maps) and\ntwo fully connected layers. The action decoder implements two fully connected layers. All layers use\nthe exponential linear unit (ELU) activations [Clevert et al., 2015]. All networks are trained using\nRMSProp with a learning rate of 5e\u22125. For TORPIDO, we set N = 4, i.e., the training phase uses\nfour source problems. Random problems are generated for training using the generators available\nfor each domain. All weights of GCN of policy network, and the RL module of policy network are\nshared.\nWe implement two baseline algorithms for comparison. First, we implement a base non-transfer\nalgorithm, A3C. This is chosen since TORPIDO uses A3C as its RL agent. Thus, this comparison\nwill directly show the value of transfer. We also implement a state of the art RL transfer solution\ncalled A2T \u2013 Attend, Adapt and Transfer [Rajendran et al., 2017], which retrains while using attention\nover the learned source policies for selective transfer. It also uses the same four problems as source\nas in TORPIDO. This comparison will expose the speci\ufb01c value of our transfer mechanism, which\nuses the RDDL representation, compared to a representation-agnostic transfer mechanism.\nEvaluation Metrics: First, we output learning curves. For that we stop training after a set number\nof training iterations (say i) and estimate the return from the current policy V\u03c0(i) by simulating it till\nthe horizon speci\ufb01ed in the RDDL \ufb01le. The reported values are an average of 100 such simulations.\nMoreover, we also report the metric \u03b1(i), where \u03b1(i) = (V\u03c0(i) \u2212 Vinf )/(Vsup \u2212 Vinf ). Here Vinf\nand Vsup respectively represent the lowest and the highest values obtained on this instance (by any\n\n8\n\n\falgorithm at any time in training). Since \u03b1 is a ratio, it acts as an indicator of the training \u2019stage\u2019 of\nthe model, and hence helps to understand the transfer process as it progresses in time, irrespective of\nthe starting (random) reward and the \ufb01nal reward for the model. Moreover, \u03b1(0) acts as a measure of\n(near) zero-shot transfer.\n\n5.1 TORPIDO\u2019s Transfer Ability\n\nWe \ufb01rst measure the ability of the model to transfer knowledge across problem instances. We compare\nagainst all baselines. Figure 2 compares the learning curves of TORPIDO with the two baselines on\none problem each of the three domains (error bars are 95% con\ufb01dence intervals over ten runs). The\nresults on the other problems are quite similar. The x-axis is RL training time on the target instance,\nwhich for TORPIDO also includes the time for training of action decoder. First, we \ufb01nd that A3C\nis not very competitive with even A2T in its learning. This suggests that transfer is quite valuable\nfor these problems. We also \ufb01nd that A2T itself has substantially worse performance compared\nto TORPIDO. We attribute this to the various components in TORPIDO that exploit the domain\nknowledge expressed in RDDL representation.\nIn Table 1 we show the comparisons between these algorithms at four different training points. We\nreport the \u03b1 values, as described above. We \ufb01nd that TORPIDO is vastly ahead of all algorithms at\nall times, underscoring the immense value our architecture offers.\n\n5.2 Ablation Study\n\nIn order to understand the incremental contribution of each of our components we compare three\ndifferent versions of TORPIDO. The \ufb01rst version is A3C+GCN. This version only performs state\nencoding but does not perform any action decoding. Our next version is A3C+GCN+SAD, which\nincorporates the action decoding (and also transition transfer module to aid action decoding). Finally,\nour full system adds an IC to previous name \u2013 it includes the instance classi\ufb01cation component.\nFigure 3 shows the learning curves for the three problems. We observe that use of GCN helps the\nalgorithm converge to a high \ufb01nal score. Comparing this to vanilla A3C and A2T in Figure 2, we\nlearn that the use of GCN is critical in exposing the structure of the domain to the RL agent, helping\nit in learning a \ufb01nal good policy. However, the zero-shot nature of the transfer is very weak, because\nthe action names may be different in the source and target. Use of action-decoder and transition\ntransfer speeds up the near zero-shot transfer immensely. This can be observed from Table 2, which\ncompares these algorithms before starting the RL training. We see a huge jump in \u03b1 for the model\nwith action decoder compared to the one without it. Finally, Table 2 suggests that the improvement of\ninstance classi\ufb01cation in the beginning is signi\ufb01cant. However, very soon the incremental bene\ufb01t is\nreduced; \ufb01nal TORPIDO performs only marginally better than the A3C+GCN+SAD version.\n\n6 Conclusions\n\nWe present the \ufb01rst domain-independent transfer algorithm for transferring deep RL policies from\nsource probabilistic planning problems (expressed in RDDL language) to a target problem from\nthe same domain.2 Our algorithm TORPIDO combines a base RL agent (A3C) with several novel\ncomponents that use the RDDL model: state encoder, action decoder, transition transfer module\nand instance classi\ufb01er. Only action decoder needs to be re-learnt for a new problem; all the other\ncomponents can be directly transferred. This allows TORPIDO to perform an effective transfer\neven before the RL starts, by quickly retraining the action decoder using the given RDDL model\n(near zero-shot learning). Experiments show that TORPIDO is vastly superior in its learning curves\ncompared to retraining from scratch as well as a state-of-the-art RL transfer method. In the future,\nwe wish to extend this to transfer across problem sizes, and later transfer across domains.\n\nAcknowledgements\n\nWe thank Ankit Anand and the anonymous reviewers for their insightful comments on an earlier\ndraft of the paper. We also thank Alan Fern, Scott Sanner, Akshay Gupta and Arindam Bhattacharya\n\n2Available at https://github.com/dair-iitd/torpido\n\n9\n\n\ffor initial discussions on the research. This work is supported by research grants from Google, a\nBloomberg award, an IBM SUR award, a 1MG award, and a Visvesvaraya faculty award by Govt. of\nIndia. We thank Microsoft Azure sponsorships, and the IIT Delhi HPC facility for computational\nresources.\n\nReferences\nAnkit Anand, Aditya Grover, Mausam, and Parag Singla. ASAP-UCT: abstraction of state-action pairs in UCT.\nIn Proceedings of the Twenty-Fourth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI 2015,\nBuenos Aires, Argentina, July 25-31, 2015, pages 1509\u20131515, 2015.\n\nAnkit Anand, Ritesh Noothigattu, Mausam, and Parag Singla. OGA-UCT: on-the-go abstractions in UCT. In\nProceedings of the Twenty-Sixth International Conference on Automated Planning and Scheduling, ICAPS\n2016, London, UK, June 12-17, 2016., pages 29\u201337, 2016.\n\nRichard Bellman. A Markovian Decision Process. Indiana University Mathematics Journal, 1957.\n\nDjork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by\nexponential linear units (elus). CoRR, abs/1511.07289, 2015. URL http://arxiv.org/abs/1511.07289.\n\nAlan Fern, Murugeswari Issakkimuthu, and Prasad Tadepalli. Training deep reactive policies for probabilistic\n\nplanning problems. In ICAPS, 2018.\n\nYaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois Laviolette, Mario\nMarchand, and Victor S. Lempitsky. Domain-adversarial training of neural networks. In Domain Adaptation\nin Computer Vision Applications., pages 189\u2013209. 2017.\n\nMarta Garnelo, Kai Arulkumaran, and Murray Shanahan. Towards deep symbolic reinforcement learning. CoRR,\n\nabs/1609.05518, 2016. URL http://arxiv.org/abs/1609.05518.\n\nPalash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. CoRR,\n\nabs/1705.02801, 2017.\n\nEdward Groshev, Aviv Tamar, Maxwell Goldstein, Siddharth Srivastava, and Pieter Abbeel. Learning generalized\n\nreactive policies using deep neural networks. In ICAPS, 2018.\n\nMarek Grzes, Jesse Hoey, and Scott Sanner. International Probabilistic Planning Competition (IPPC) 2014. In\n\nICAPS, 2014. URL https://cs.uwaterloo.ca/~mgrzes/IPPC_2014/.\n\nCarlos Guestrin, Daphne Koller, and Ronald Parr. Max-norm projections for factored mdps. In Proceedings of\nthe Seventeenth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI 2001, Seattle, Washington,\nUSA, August 4-10, 2001, pages 673\u2013682, 2001.\n\nIrina Higgins, Arka Pal, Andrei A. Rusu, Lo\u00efc Matthey, Christopher Burgess, Alexander Pritzel, Matthew\nBotvinick, Charles Blundell, and Alexander Lerchner. DARLA: improving zero-shot transfer in reinforcement\nlearning. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney,\nNSW, Australia, 6-11 August 2017, pages 1480\u20131490, 2017.\n\nThomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks. In ICLR,\n\n2017.\n\nAndrey Kolobov, Mausam, and Daniel S. Weld. A theory of goal-oriented mdps with dead ends. In Proceedings\nof the Twenty-Eighth Conference on Uncertainty in Arti\ufb01cial Intelligence, Catalina Island, CA, USA, August\n14-18, 2012, pages 438\u2013447, 2012.\n\nMichael L. Littman. Probabilistic propositional planning: Representations and complexity. In Proceedings of\nthe Fourteenth National Conference on Arti\ufb01cial Intelligence and Ninth Innovative Applications of Arti\ufb01cial\nIntelligence Conference, AAAI 97, IAAI 97, July 27-31, 1997, Providence, Rhode Island., pages 748\u2013754,\n1997.\n\nTambet Matiisen, Avital Oliver, Taco Cohen, and John Schulman. Teacher-student curriculum learning. CoRR,\n\nabs/1707.00183, 2017. URL http://arxiv.org/abs/1707.00183.\n\nMausam and Andrey Kolobov. Planning with Markov Decision Processes: An AI Perspective. Morgan &\n\nClaypool Publishers, 2012.\n\n10\n\n\fVolodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex\nGraves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik,\nIoannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\nVolodymyr Mnih, Adri\u00e0 Puigdom\u00e8nech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley,\nDavid Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceed-\nings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June\n19-24, 2016, pages 1928\u20131937, 2016.\n\nEmilio Parisotto, Lei Jimmy Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer reinforce-\n\nment learning. CoRR, abs/1511.06342, 2015. URL http://arxiv.org/abs/1511.06342.\n\nM.L. Puterman. Markov Decision Processes. John Wiley & Sons, Inc., 1994.\n\nJ. Rajendran, A. S. Lakshminarayanan, M. M. Khapra, P. Parthasarathy, and B. Ravindran. Attend, adapt, and\ntransfer: Attentive deep architecture for adaptive transfer from multiple sources in the same domain. In ICLR,\n2017.\n\nBalaraman Ravindran. An Algebraic Approach to Abstraction in Reinforcement Learning. PhD thesis, University\n\nof Massachusetts Amherst, 2004.\n\nScott Sanner. Relational Dynamic In\ufb02uence Diagram Language (RDDL): Language Description. 2010.\n\nRichard S. Sutton and Andrew G. Barto. Reinforcement learning - an introduction. Adaptive computation and\n\nmachine learning. MIT Press, 1998.\n\nAviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Proceedings\nof the Twenty-Sixth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI 2017, Melbourne, Australia,\nAugust 19-25, 2017, pages 4949\u20134953, 2017.\n\nSam Toyer, Felipe W. Trevizan, Sylvie Thi\u00e9baux, and Lexing Xie. Action schema networks: Generalised policies\nwith deep learning. In Proceedings of the Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, New\nOrleans, Louisiana, USA, February 2-7, 2018, 2018.\n\nRonald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine Learning, 8:229\u2013256, 1992.\n\nH\u00e5kan L. S. Younes, Michael L. Littman, David Weissman, and John Asmuth. The \ufb01rst probabilistic track of the\n\ninternational planning competition. J. Artif. Intell. Res., 24:851\u2013887, 2005.\n\n11\n\n\f", "award": [], "sourceid": 8009, "authors": [{"given_name": "Aniket (Nick)", "family_name": "Bajpai", "institution": "MIT"}, {"given_name": "Sankalp", "family_name": "Garg", "institution": "Indian Institute of Technology Delhi"}, {"given_name": "Mausam", "family_name": "", "institution": "IIT Dehli"}]}