{"title": "Zero-Shot Transfer with Deictic Object-Oriented Representation in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2291, "page_last": 2299, "abstract": "Object-oriented representations in reinforcement learning have shown promise in transfer learning, with previous research introducing a propositional object-oriented framework that has provably efficient learning bounds with respect to sample complexity. However, this framework has limitations in terms of the classes of tasks it can efficiently learn. In this paper we introduce a novel deictic object-oriented framework that has provably efficient learning bounds and can solve a broader range of tasks. Additionally, we show that this framework is capable of zero-shot transfer of transition dynamics across tasks and demonstrate this empirically for the Taxi and Sokoban domains.", "full_text": "Zero-Shot Transfer with Deictic Object-Oriented\n\nRepresentation in Reinforcement Learning\n\nO\ufb01r Marom1, Benjamin Rosman 1,2\n\n1University of the Witwatersrand, Johannesburg, South Africa\n\n2Council for Scienti\ufb01c and Industrial Research, Pretoria, South Africa\n\nAbstract\n\nObject-oriented representations in reinforcement learning have shown promise\nin transfer learning, with previous research introducing a propositional object-\noriented framework that has provably ef\ufb01cient learning bounds with respect to\nsample complexity. However, this framework has limitations in terms of the classes\nof tasks it can ef\ufb01ciently learn. In this paper we introduce a novel deictic object-\noriented framework that has provably ef\ufb01cient learning bounds and can solve a\nbroader range of tasks. Additionally, we show that this framework is capable\nof zero-shot transfer of transition dynamics across tasks and demonstrate this\nempirically for the Taxi and Sokoban domains.\n\n1\n\nIntroduction\n\nA longstanding objective in reinforcement learning (RL) is transfer learning, where the aim is to\naccelerate learning in an unseen task using knowledge gained in previously learned tasks [14]. Various\nMarkov decision process (MDP) representations have shown promise in this regard [5, 6, 7, 11, 9]. A\ncommon choice among such representations is one where components of an MDP are described with\nobjects [7, 4, 12, 8]. In particular, Propositional Object-Oriented MDPs (Propositional OO-MDPs)\n[4] introduce a framework to represent the state-space of an MDP in terms of objects while the\ntransition dynamics are represented in terms of propositional preconditions over object classes that\nmap to effects over attributes of the object classes. An appealing property of Propositional OO-MDPs\nis that the transition dynamics have provably ef\ufb01cient learning bounds for deterministic environments\nthat have been shown to outperform competing model-based approaches for certain domains [3].\nUnfortunately, the core restriction of Propositional OO-MDPs that the preconditions be described\nonly in terms of propositions tends to be a strong one and, as even the original authors point out,\nprecludes ef\ufb01cient learning of certain classes of tasks [3]. Such tasks include those where it is\nrequired to distinguish between different objects of the same object class. As a speci\ufb01c example of\nthis, and further elaborated in subsection 2.3, consider the Sokoban domain where a person attempts\nto push a box but cannot do so if that box is adjacent to a wall. In this case, there is no way of tying\nthe box that is adjacent to the person with the box that is adjacent to the wall with propositions. To\naccommodate such tasks, the transition dynamics for Propositional OO-MDPs must be appended\nwith \ufb01rst-order predicates 1. However, this impacts the learning ef\ufb01ciency as well as the ability to\ntransfer between tasks since adding more objects now increases the number of preconditions that the\ntransition dynamics depend on.\nIn this paper we propose to overcome this limitation of Propositional OO-MDPs with a novel deictic\nobject-oriented representation, Deictic OO-MDPs. The key insight behind Deictic OO-MDPs is\nthe concept of deictic predicates. Deictic predicates are grounded only with respect to a central\n\n1In this paper we refer to propositions as logic statements over object classes, while we refer to \ufb01rst-order\n\npredicates as logic statements that depend on at least one grounded object.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdeictic object, therefore that object may relate itself to non-grounded object classes, but not to other\ngrounded objects. Returning to the Sokoban domain, a deictic predicate over boxes allows a speci\ufb01c\nbox to ascertain whether any wall is adjacent to it, but not whether a speci\ufb01c wall is adjacent to it. As\nwe formalise in section 2, Deictic OO-MDPs are de\ufb01ned in terms of a schema (or template) and so\nfor all tasks instantiated from that schema the number of preconditions that the transition dynamics\ndepend on remains constant. This makes transfer of the transition dynamics possible across all tasks\ninstantiated from that schema, which is illustrated empirically in section 4. Furthermore, Deictic\nOO-MDPs allow for the ef\ufb01cient learning of transition dynamics for classes of tasks not possible\nwith propositional frameworks, as discussed in section 3.\n\n2 Framework\n\n2.1 Background\n\nMost commonly an RL task is described as a discrete-time, \ufb01nite-state and \ufb01nite-action Markov\ndecision process (MDP) [13]. Given an MDP, M, a well-known model-based algorithm to learn a\nnear-optimal policy for M is Rmax [1], which is known to have polynomial sample complexity. The\nKWIK (knows what it knows) [10] framework generalises Rmax to a broader range of representations,\nsuch that if the transition dynamics can be learned with polynomial sample complexity under the\nKWIK protocol, it is possible to construct an Rmax type algorithm to learn a near-optimal policy\nfor M. The main requirement of the KWIK framework is that when an agent is required to make\na prediction of the next state distribution given the agent\u2019s current state and action, the agent may\nchoose to return \u22a5 instead, meaning that the agent is unable to make an accurate prediction as it has\nyet to explore the environment suf\ufb01ciently. Furthermore, the number of times the agent may return \u22a5\nmust have polynomial bound (which is called the KWIK bound).\nThe KWIK framework has been used in conjunction with object-oriented representations for ef\ufb01cient\nlearning [4]. The main idea behind object-oriented representations is that the state-space is made up\nof grounded objects that are instantiations of object classes. Such representations for the state-space\nwere introduced with Relational MDPs that de\ufb01ne a domain in terms of a schema [7].\nFormally, the state-space for such a schema consists of a set of object classes C = {Ci}NC\ni=1. Each\nobject class C \u2208 C has a set of attributes Att(C) = {C.\u03b1i}NC\ni=1 and each attribute C.\u03b1 \u2208 Att(C) of\nan object class has a domain Dom(C.\u03b1). Given a schema, a grounded state-space is instantiated by\n\ufb01rst selecting a grounded object set which consists of n objects O = {oi}n\ni=1 where each o \u2208 O is an\ninstance of some object class C. The value of attribute C.\u03b1 for object o is denoted by o.\u03b1. Then the\ngrounded state-space, denoted SO, is an assignment of each o.\u03b1 for all objects in O. The schema\nstate-space, denoted S, is the set of all states for all possible object sets O.\nTo make the notion of a schema state-space concrete, consider the classical Taxi domain [2] where\na taxi in a gridword has the task of picking up a passenger at some pickup location and dropping\nthem off at some destination location. The actions available to the taxi are N orth, East, South,\nW est, P ickup and Dropof f while walls limit the taxi\u2019s movements. We introduce a more general\nextension of this domain called the all-passenger any-destination Taxi domain where a taxi is tasked\nto pick up multiple passengers and drop each of them off at one of any destination locations. The\ntaxi can only pick up one passenger at a time, so if a passenger is already in the taxi and the P ickup\naction is taken while the taxi is at the pickup location of another passenger, the state does not change.\nWe can represent this more general Taxi domain with four object classes: T axi, W all, P assenger\nand Destination. Each object class has attributes x and y for their location on the grid. Object class\nW all has an additional attribute pos to mark one of four positions in a square, while P assenger\nhas additional attributes in-taxi and at-destination to indicate if the passenger is in a taxi and at\na destination respectively. Given the schema for this domain we can instantiate a set of grounded\nobjects from the object classes and a resulting MDP. Figure 1 shows sample states of the schema.\nPropositional OO-MDPs and Relational MDPs use similar state-space representations as described\nabove. Where they differ is in the de\ufb01nition of their transition dynamics. While Relational MDPs\nuse aggregation for preconditions [7], Propositional OO-MDPs use propositions for preconditions\n[4]. Furthermore, the work on Relational MDPs assumes that the transition dynamics are known\nand the focus is on learning generalised plans. Meanwhile, Propositional OO-MDPs assume that the\ntransition dynamics are unknown and need to be learned through interaction with the environment.\n\n2\n\n\fFigure 1: P marks a passenger; D marks a destination; T marks a taxi; thicker lines mark walls. Five\npossible states from the all-passenger any-destination Taxi domain schema.\n\n(a)\n\n(b)\n\n2.2 Deictic object-oriented representation\n\ni=1\n\ni=1\n\ni=1\n\nOur Deictic OO-MDP framework uses the schema state-space representation described in subsection\n2.1 while deictic predicate preconditions are used to de\ufb01ne the schema transition dynamics as\ndescribed in this section. Let A be a set of actions. Then for each attribute C.\u03b1 and action a \u2208 A\nde\ufb01ne a set of effects Ea,C.\u03b1 = {ei : Dom(C.\u03b1) \u2192 Dom(C.\u03b1)}Ka,C.\u03b1\n. De\ufb01ne a set of deictic\npredicate preconditions Fa,C.\u03b1 = {fi : O[C] \u00d7 S \u2192 B}Da,C.\u03b1\n, where O[C] is a set that contains\nobjects with all possible attribute value assignments that are instances of C, and B = {0, 1}. Then the\nprobabilistic transition dynamics for C.\u03b1 and a are de\ufb01ned by Pa,C.\u03b1 : BDa,C.\u03b1 \u00d7 Ea,C.\u03b1 \u2192 [0, 1].\nThe schema transition dynamics P is the set of transition dynamics for all attributes and actions,\nP = {Pa,C.\u03b1|C \u2208 C, C.\u03b1 \u2208 Att(C), a \u2208 A}. The schema reward dynamics are de\ufb01ned by\nR : S \u00d7 A \u00d7 S \u2192 R.\nGiven an object set O we can instantiate a grounded MDP MO,\u03c1 = (SO, A, P, R, \u03b3, \u03c1) where \u03b3 is a\ndiscount rate and \u03c1 is a distribution over initial states. Then if an agent is currently in state s \u2208 SO\nand takes action a, the transition dynamics for MO,\u03c1 operate as follows: for each object o in s that is\nan instance of C and each attribute C.\u03b1 we compute the Boolean truth values B = {fi(o, s)}Da,C.\u03b1\nfor the deictic predicates in Fa,C.\u03b1. Then for an effect e \u2208 Ea,C.\u03b1 we compute Pa,C.\u03b1(B, e) which\nreturns the probability of e occurring given B. This implies a distribution over effects which in turn\nimplies a distribution over the attribute values of o by applying the effect to o.\u03b1 in s and obtaining\ne(o.\u03b1) = o.\u03b1(cid:48) in s(cid:48).\nFor example, consider attribute T axi.x and action East for the all-passenger any-destination Taxi\ndomain introduced in subsection 2.1. We can de\ufb01ne a set of relative effects Reli(x) \u2192 x + i that\nproduce a shift of i squares from the current location x, as well as a deictic predicate TE(taxi, s) that\nreturns 1 if taxi has a wall one square to its east in s, otherwise 0. Then the transition dynamics can\nbe described for any taxi object as TE(taxi, s) = 1 =\u21d2 taxi.x \u2190 Rel0(taxi.x) with probability 1\nand TE(taxi, s) = 0 =\u21d2 taxi.x \u2190 Rel1(taxi.x) with probability 1. For a slightly more complex\nexample consider attribute P assenger.in-taxi and action P ickup. The transition dynamics for this\nattribute depend on three preconditions for which we require a deictic passenger object: is a taxi on\nthe same square as passenger? Is passenger.at-destination true? Is there any passenger in a taxi?\nThe key insight with Deictic OO-MDPs is that the parameters we pass to each precondition in Fa,C.\u03b1\nare a grounded deictic object o that must be an instance of C and s which is a state of the schema, not\na grounded state. As a result these preconditions may not refer to speci\ufb01c objects in s; however, they\nmay relate o to object classes of the schema. For example, with TE(taxi, s) as de\ufb01ned above only\ntaxi is grounded while we never refer to a grounded object in s.\nGiven a resulting state s(cid:48), the reward dynamics operate by computing R(s, a, s(cid:48)). For the purposes\nof this paper we will assume that R is known while P needs to be learned. A subtle but important\nrequirement for our de\ufb01nition of Pa,C.\u03b1 to be correct is that the set Ea,C.\u03b1 must be invertible so that\nif the current value assignment for some attribute C.\u03b1 of a grounded object o in state s is o.\u03b1 and we\ntake action a to subsequently observe o.\u03b1(cid:48) in s(cid:48) then there must be a unique e \u2208 Ea,C.\u03b1 such that\ne(o.\u03b1) = o.\u03b1(cid:48). Clearly if this requirement is not met then either the effects are not able to correctly\ncapture the true transition dynamics or there are duplicate effects that lead to the same o.\u03b1(cid:48) which\ncreates ambiguity over which effect occurred.\n\n3\n\n\f2.3 Limitations of propositional object-oriented representation\n\nIt is bene\ufb01cial to contrast the Deictic OO-MDP representation introduced in subsection 2.2 to\nPropositional OO-MDPs. Propositions alone are insuf\ufb01cient to represent the transition dynamics for\nthe all-passenger any-destination Taxi domain. To see this, suppose we have two passenger objects\nand the proposition On(T axi, P assenger) with truth value 1. This can be translated as: a taxi\nis on the same square as a passenger is true. Clearly this information is insuf\ufb01cient to determine\nwhich passenger\u2019s in-taxi attribute should change given the P ickup action. To overcome this we\nmust resort to \ufb01rst-order predicates over grounded objects of the form On(T axi, passenger1) and\nOn(T axi, passenger2). Note that the number of preconditions changes as we add more passenger\nobjects which complicates both learning and transfer procedures.\nAs a further example, consider the Sokoban domain where a person is required to push boxes to\nsome storage locations. However, the person cannot push a box if that box is adjacent to another\nbox or a wall. As discussed by the authors of Propositional OO-MDPs such domains are not\nsuitable for propositional representations [3]. Consider the propositions TW (Box, P erson) and\nTE(Box, W all) both with truth value 1. The \ufb01rst term translates as: a box has a person one square\nto its west is true, while the second translates as: a box has a wall one square to its east is true.\nThen the conjunction TW (Box, P erson) = 1 \u2227 TE(Box, W all) = 1 is insuf\ufb01cient to determine\nthe transition dynamics of the box\u2019s x attribute when taking action East since there is no way to\nknow if the terms are referring to the same box. See Figure 2 for illustration. Deictic OO-MDPs\nare able to represent these conditions without ambiguity. Consider the Box.x attribute with action\nEast. De\ufb01ne f1(box, s) to return 1 if box has a person one square to its west in s, otherwise\n0. De\ufb01ne f2(box, s) to return 1 if box has a wall one squares to its east in s, otherwise 0. Then\nf1(box, s) = 1 \u2227 f2(box, s) = 1 =\u21d2 box.x \u2190 box.x + 0.\nIn fact, representing the transition dynamics for this domain in terms of Deictic OO-MDPs is\nstraightforward. For a deictic person object there are four preconditions when the action East is\ntaken: is there a box one square east of person? Is there a box two squares east of person? Is there a\nwall one square east of person? Is there a wall two squares east of person? Meanwhile for a deictic\nbox object there are three preconditions: is there a person one square west of box? Is there a wall one\nsquare east of box? Is there a box one square east of box? The other actions are analogous.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: For action East and the box adjacent to the person, in (2a) the effect is box.x \u2190\nbox.x + 0; in (2b) the effect is box.x \u2190 box.x + 1 while the conjunction TW (Box, P erson) =\n1 \u2227 TE(Box, W all) = 1 is true in both cases. (2c) is a more complex Sokoban task with four boxes\nfrom the \u201cMicro-Cosmos\u201d level pack which requires 209 steps to solve under an optimal policy.\n\n3 Learning transition dynamics\n\nGiven a set of D preconditions we want to learn the transition dynamics for each attribute C.\u03b1 and\naction a. If the transition dynamics are deterministic this can be done using memorisation [10] with\nKWIK bound 2D. However, this is prohibitive if D is large. Propositional OO-MDPs introduce a\nlearning algorithm called DOORM AX [4] for deterministic transition dynamics that, under certain\nassumptions, has provably ef\ufb01cient KWIK bounds. Moreover, DOORM AX is able to learn from\nmultiple hypothesised effect sets and determine the correct effect set for each attribute and action.\nThe main assumption required for DOORM AX to be correct is that the transition dynamics for\neach attribute and action must be representable as a full binary tree with propositions at the non-leaf\nnodes and effects at the leaf nodes. Furthermore, each possible effect of an effect set can only occur\n\n4\n\n\fat most at one leaf node of the tree, except for a special effect called a \u201cfailure condition\u201d that may\noccur at multiple leaf nodes. A failure condition implies that globally no attribute changed when an\naction was taken i.e. s = s(cid:48) when a is taken. See Figure 3a for how this represented for the T axi.x\nattribute with action East.\nThe intuition behind DOORM AX is that in many cases the number of preconditions that an effect\ndepends on is much smaller than D. Furthermore, since an effect can occur at most once in the\ntree, we can invalidate many terms with a small number of observations by using conjunctions. For\nexample, suppose we observe the following different sets of terms that each produce the same effect\nT axi.x \u2190 T axi.x + 1: T1 = {TN (T axi, W all) = 0, TE(T axi, W all) = 0, TS(T axi, W all) =\n0, TW (T axi, W all) = 0} and T2 = {TN (T axi, W all) = 1, TE(T axi, W all) = 0,\nTS(T axi, W all) = 1, TW (T axi, W all) = 1}. Then we can determine from only two observations\nthat the relevant terms for this effect are the ones that occur in T1 \u2227 T2 i.e. TE(T axi, W all) = 0. As\nfailure conditions can occur at multiple leaf nodes they need to be learned on a case by case basis.\nWe adapt the DOORM AX algorithm to Deictic OO-MDPs, which we call DOORM AXD. The\nmain difference for DOORM AXD is that we remove the notion of a global failure condition because\nthis depends on a grounded state comprised of grounded objects. Meanwhile the transition dynamics\nof our representation are schema based to allow for transferability across tasks. Instead we require\nthat all effects apply to a single attribute. See Figure 3b for how this is represented for the T axi.x\nattribute with action East. To achieve this, we introduce a partition function over effects that groups\nthem into those that can occur at most at one leaf node and those that can occur at multiple leaf nodes.\n\n(a)\n\n(b)\n\nFigure 3: Full binary tree structure for the\ntransition dynamics of T axi.x attribute and\naction East.\n(3a) for Propositional OO-\nMDPs; (3b) for Deictic OO-MDPs. Right\nbranches represent a truth value of 1.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: Binary trees induced by Te1 and Te2.\nRight branches represent a truth value of 1.\n\nLet F be a set of deictic predicate preconditions and E be a set of effects.\nDe\ufb01nition 1. A term is a tuple (f, b) where f \u2208 F and b \u2208 B. A set of terms is denoted by T and a\nset that contains sets of terms is denoted by T .\nDe\ufb01nition 2. Term (f1, b1) mismatches term (f2, b2) if f1 = f2 and b1 (cid:54)= b2\nDe\ufb01nition 3. \u03a0 : E \u2192 B is a binary partition function over E and assigns each effect in E to one\nof two partitions, 0 or 1. We call the tuple g = (E, \u03a0) an effect type. Denote by K g\n1 the\nnumber of effects in partition 0 and 1 respectively. We use . notation to refer to an element in a tuple\ng, so for example g.E refers to E in g\nDe\ufb01nition 4. Let g be an effect type. Let M > 1 be a constant. Then T ree(g, F, M ) is the set of\nall full binary trees such that non-leaf nodes are a subset of F and leaf nodes are a subsets of g.E.\nFurthermore, if g.\u03a0 assigns an effect in g.E to partition 1 then that effect can occur at most at one\nleaf node and we call that effect conjunctive, otherwise that effect may occur at most at M leaf nodes\nand we call that effect disjunctive.\nTheorem 1. For attribute C.\u03b1 and action a let \u02c6F be a set of size D that contains hypothesised\ndeictic predicate preconditions on which the transition dynamics of C.\u03b1 when taking action a may\ndepend. Let \u02c6G = {gi}N\ni=1 be a set of size N that contains hypothesised effect types where each\ne \u2208 gi.E has domain and range Dom(C.\u03b1). Let K0 = maxg\u2208 \u02c6G K g\n1 . Let\nH = {T ree(g, \u02c6F , M )|g \u2208 \u02c6G} for some constant M > 1. Then if exactly one h\u2217 \u2208 H is true, the\ntransition dynamics for C.\u03b1 and a can be learned with KWIK bound N (K0M +K1(D+1)+1)+N\u22121\nby applying the learning procedure of Algorithm 1. The proof of Theorem 1 is in section 4 of the\nsupplementary material and is analogous to the KWIK bound proof for Propositional OO-MDPs [4].\n\n0 and K1 = maxg\u2208 \u02c6G K g\n\n0 and K g\n\n5\n\n\fNote that DOORM AXD, being an Rmax based algorithm, requires two procedures - one to learn\nthe transition dynamics from observed data and one to make predictions for planning. In this paper\nwe present only the learning procedure (see Algorithm 1), leaving out the prediction procedure as it\nis analogous to that introduced for Propositional OO-MDPs [4]. For prediction, the essence of the\nprocedure is that for an attribute C.\u03b1 and action a the input is a deictic object o that is an instance of\nC and a state s, and it is required that for each effect type in \u02c6G we can make a prediction that is not \u22a5\nand further that all the predictions of o.\u03b1(cid:48) are the same, otherwise return \u22a5. The full procedure is\nincluded in section 3 of the supplementary material.\n\nif T g\n\nif T g\n\ne \u2190 \u03c6\n\nforeach e \u2208 g.E do\n\nend\nif e(o.\u03b1) = o.\u03b1(cid:48) then\n\ne does not exist then\nde\ufb01ne T g\n\ne and initialise to the empty set, T g\n\ne = \u03c6 then\nif (\u2203e(cid:48) \u2208 g.E and Te(cid:48) \u2208 T g\n\nAlgorithm 1: DOORM AXD: learning procedure for C.\u03b1 and a.\nInput : o \u2208 O[C], o.\u03b1(cid:48) \u2208 Dom(C.\u03b1), s \u2208 S\n1 pass o and s to the deictic predicates in \u02c6F to retrieve a set of terms T\n2 foreach g \u2208 \u02c6G do\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n\nadd T to T g\nTtemp \u2190 T\nT \u2190 the only element in T g\nremove from T any terms that mismatch terms in Ttemp\ne \u2190 \u03c6\nT g\nadd T to T g\n\ne(cid:48) ) such that Te(cid:48) \u2286 T then\n\nremove g from \u02c6G\nadd T to T g\n\ne\n\nif g.\u03a0(e) = 0 then\n\nelse\n\nelse\n\nend\n\nelse\n\ne\n\ne\n\nend\n\nend\n\n25\n26\n27\n28\n29\n30 end\n\nend\n\nend\n\ne\n\nend\nif (\u2203(e(cid:48) \u2208 (g.E \u2212 {e}) and Te(cid:48) \u2208 T g\n(|T g\n\ne | > M ) then\nremove g from \u02c6G\n\ne(cid:48) ) such that (T \u2286 Te(cid:48) or Te(cid:48) \u2286 T )) or\n\nThe key insight behind Algorithm 1 is that for all g \u2208 \u02c6G the set {T g\ne }e\u2208g.E must at all times induce\na binary tree subject to the constraints of the partition function g.\u03a0. Each time we observe a set of\nterms T and an associated effect e we update T g\ne and in doing so may discover that the resulting set\n{T g\ne }e\u2208g.E can no longer induce an appropriate binary tree at which point we remove g from \u02c6G. To\nprovide some intuition on the operations in the algorithm consider a case where F = {f1, f2, f3}\nand there are two effects E = {e1, e2} where e1 is conjunctive and e2 is disjunctive. Consider the\nfollowing examples: Figure 4a: we currently have Te1 = {Te1 = {(f1, 1), (f2, 1), (f3, 1)}} and\nTe2 = \u03c6. Suppose we then observe e2 with T = {(f1, 1), (f2, 1), (f3, 1)}. Then Te2 is empty and\nTe1 \u2286 T . Figure 4b: we currently have Te1 = {Te1 = {(f1, 1), (f2, 1), (f3, 1)}} and Te2 = {Te2 =\n{(f1, 1), (f2, 1), (f3, 0)}}. Suppose we then observe e1 with T = {(f1, 1), (f2, 0), (f3, 0)}. As e1 is\nconjunctive and Te1 is not empty we \ufb01rst remove mismatching terms. Then Te1 = {T = {(f1, 1)}}\nand now T \u2286 Te2. Figure 4c: we currently have Te1 = {Te1 = {(f1, 1)}} and Te2 = {Te2 =\n{(f1, 0), (f2, 0), (f3, 0)}}. Suppose we then observe e2 with T = {(f1, 1), (f2, 0), (f3, 1)}. We add\n\n6\n\n\fT to Te2 and now Te1 \u2286 T . In all the above cases, we conclude that the effect type is invalid since\nthe observed data can no longer induce a binary tree subject to the speci\ufb01ed constraints. Note that we\ndo not place any restrictions on the order in which the preconditions may appear in the tree, but there\nis no reordering that can recover an appropriate binary tree given the data.\n\n4 Experiments\n\n4.1 All-passenger any-destination Taxi domain\n\nWe conduct two sets of experiments on this domain. In the \ufb01rst set we have one destination and we\n\ufb01x the number of passengers, n. We generate a grounded MDP with an initial state by randomly\nsampling n passenger locations and one destination location from one of six pre-speci\ufb01ed locations\nand we also sample a random taxi start location together with one of four wall con\ufb01gurations as\nshown in Figure 1a. We apply 20 independent runs of the following procedure: we sample 10 test\nMDPs with random initial states. We then randomly sample a training MDP and run DOORM AXD\non it for one episode until we reach the terminal state. Upon termination, we test performance by\nrunning DOORM AXD for one episode on each of the 10 test MDPs, stopping an episode early if\nwe exceed 500 steps. We repeat this for 100 training MDPs. Since all the MDPs come from the same\nschema we can share transition dynamics between our MDPs - but we only update the transition\ndynamics on training MDPs. In our experiments we start with n = 1 passenger and increase to\nn = 4 passengers. We run our experiments for Propositional OO-MDPs and two versions of Deictic\nOO-MDPs. In the \ufb01rst, without transfer, we relearn the transition dynamics for each n while for the\nsecond, with transfer, we transfer the previously learned transition dynamics each time we increase n.\nWe report results in Figure 5 that averages over the 20 independent runs the average number of steps\nfor the 10 test MDPs with error bars included.\n\n(a) One passenger experimental results.\n\n(b) Two passenger experimental results.\n\n(c) Three passenger experimental results.\n\n(d) Four passenger experimental results.\n\nFigure 5: Experimental results for the all-passenger any-destination Taxi domain with different\nnumber of passengers.\n\nWe see from the results that for this domain deictic representations outperform the propositional\nrepresentation as we increase the number of passengers. This is because with the propositional\nrepresentation we need to add more preconditions as we increase the number of passengers - in\nfact the propositional representation is unable to learn the task when n = 4 even after 100 training\nepisodes. Furthermore, as the MDPs belong to the same schema it is bene\ufb01cial to transfer the previous\ntransition dynamics under the deictic representation. Speci\ufb01cally, once we get to n = 4 passengers\nthe transition dynamics we transfer from the n = 3 experiment have learned the schema transition\ndynamics completely and we have zero-shot transfer of the transition dynamics.\n\n7\n\n\fWe observe from Figure 5 that using the deictic representation without transfer is actually learning\nslightly faster as we add more passengers. This is somewhat misleading. What is actually happening\nis that as the tasks become more complex the agent is able to learn more about the transition dynamics\nover a single training episode, but that episode will require many more steps to complete. To illustrate\nthis and also highlight the robustness of the deictic representation with transfer methodology we\nconduct a second set of experiments. These experiments are similar to those conducted before but we\nnow use a larger 10 \u00d7 10 gridworld with \ufb01ve passengers and three destinations as in Figure 1b. In\nthese experiments we stop after 100 steps for each episode of the training MDPs. Furthermore, for\nthe deictic representation with transfer we simply transfer the learned transition dynamics of n = 4\npassengers and do no additional learning on the new larger gridworld.\nIn Figure 6 we plot for all the experiments run the average number of steps relative to optimal number\nof steps. We see that the deictic representation with transfer is able to solve the larger gridworld\noptimally with no additional learning of the transition dynamics. Meanwhile the deictic representation\nwith no transfer which was decreasing up to n = 4 now has a jump between n = 4 and n = 5 because\nthe agent does not have the bene\ufb01t of learning for a full training episode. We cap the graph\u2019s y axis at\n10 to make it more readable, but remark that the propositional representation exhibits exponentially\nworse performance relative to optimal as the tasks become more complex. We include additional\ndetails on the transition dynamics for this domain in section 1 of the supplementary material.\n\nFigure 6: Average number of steps relative to optimal number of steps as we add more passengers -\nfor n = 5 we also increase the gridworld size and add more destinations hence it is marked with a \u2217.\n\n4.2 Sokoban domain\n\nTo demonstrate the bene\ufb01ts of Deictic OO-MDPs, we conduct an experiment on a more challenging\nSokoban domain. In this experiment we \ufb01rst learn the transition dynamics on an MDP with initial\nstate as in Figure 2a and continue learning until we have a prediction for every state and action\nin the MDP. As it turns out, this simple toy MDP with 7, 961 states is enough to completely learn\nthe schema transition dynamics of this domain under the Deictic OO-MDP representation. Once\nlearned we zero-shot transfer the transition dynamics to a more complex Sokoban task as shown in\n2c. This task comes from the \u201cMicro-Cosmos\u201d level pack and has approximately 106 states while the\noptimal number of steps to solve this task 209. With no additional learning we run value iteration\nand solve for the optimal policy. Note that the ability to transfer here is critical. The larger MDP has\napproximately 125 times more states than the toy MDP. Running Rmax based algorithms directly on\nthe larger MDP is very slow because at each step it is required to compute a policy with a planning\nalgorithm such as value iteration that requires multiple iterations over the state-space to converge.\nBy transferring the transition dynamics learned in the toy MDP we can solve the larger MDP with\nonly a single run of value iteration which is extremely ef\ufb01cient. We include additional details on the\ntransition dynamics for this domain in section 2 of the supplementary material.\n\n5 Concluding remarks\n\nIn this paper we have introduced a novel deictic object-oriented representation for RL. We have\nshown that this representation can be described in terms of a schema that allows reuse of transition\ndynamics across grounded MDPs instantiated from that schema and that this allows for zero-shot\ntransfer across such MDPs. Theoretically we have proved that under certain assumptions we can\nef\ufb01ciently learn deterministic transition dynamics. We conducted experiments on an extension of the\nTaxi domain as well as a more challenging Sokoban domain and have shown that zero-shot transfer\nof transition dynamics is possible with our representation.\n\n8\n\n\fAcknowledgments\n\nThe authors with to thank Google Travel and Conference Grants for their support. The authors also\nwish to thank the anonymous reviewers for their thorough feedback and helpful comments.\n\nReferences\n[1] R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-\n\noptimal reinforcement learning. Journal of Machine Learning Research, 3:213\u2013231, 2002.\n\n[2] T. G. Dietterich. The maxq method for hierarchical reinforcement learning. Proceedings of the\n\n15th International Conference on Machine Learning, 1998.\n\n[3] C. Diuk. An object-oriented representation for ef\ufb01cient reinforcement learning. PhD thesis,\n\nRutgers The State University of New Jersey-New Brunswick, 2010.\n\n[4] C. Diuk, A. Cohen, and M. L. Littman. An object-oriented representation for ef\ufb01cient reinforce-\nment learning. Proceedings of the 25th International Conference on Machine learning, pages\n240\u2013247, 2008.\n\n[5] S. D\u017eeroski, L. DeRaedt, and K. Driessens. Relational reinforcement learning. Machine\n\nLearning Journal, 43:7\u201352, 2001.\n\n[6] S. Finney, N. Gardiol, L. P. Kaelbling, and T. Oates. Learning with deictic representations.\n\nTechnical report, Massachusetts Institute of Technology, 2002.\n\n[7] C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia. Generalizing plans to new environments\nin relational mdps. Proceedings of the 18th Joint Conference on Arti\ufb01cial Intelligence, pages\n1003\u20131010, 2003.\n\n[8] K. Kansky, T. Silver, D. A. M\u00e8ly, M. Eldawy, M. L\u00e0zaro-Gredilla, X. Lou, N. Dorfman, S. Sidor,\nS. Phoenix, and D. George. Schema networks: Zero-shot transfer with a generative causal model\nof intuitive physics. Proceedings of the 34th International Conference on Machine Learning,\npages 1809\u20131818, 2017.\n\n[9] G. Konidaris and A. Barto. Building portable options: Skill transfer in reinforcement learning.\nProceedings of the Twentieth International Joint Conference on Arti\ufb01cial Intelligence, pages\n895\u2013900, 2007.\n\n[10] L. Li, M. L. Littman, and T. J. Walsh. Knows what it knows: A framework for self-aware\nlearning. Proceedings of the 25th International Conference on Machine Learning, pages\n568\u2013575, 2008.\n\n[11] B. Ravindran and A. Barto. Relativized options: choosing the right transformation. Proceedings\n\nof the 12th Yale Workshop on Adaptive and Learning Systems, pages 109\u2013114, 2003.\n\n[12] J. Scholz, M. Levihn, C. L. Isbell, and D. Wingate. A physics-based model prior for object-\noriented mdps. Proceedings of the 31st International Conference on Machine Learning, pages\n1089\u20131097, 2014.\n\n[13] R. S. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n\n[14] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey.\n\nJournal of Machine Learning Research, 10(1):1633\u20131685, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1148, "authors": [{"given_name": "Ofir", "family_name": "Marom", "institution": "University of the Witwatersrand"}, {"given_name": "Benjamin", "family_name": "Rosman", "institution": "University of the Witwatersrand / CSIR"}]}