{"title": "Symmetry-Based Disentangled Representation Learning requires Interaction with Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 4606, "page_last": 4615, "abstract": "Finding a generally accepted formal definition of a disentangled representation in the context of an agent behaving in an environment is an important challenge towards the construction of data-efficient autonomous agents. Higgins et al. recently proposed Symmetry-Based Disentangled Representation Learning, a definition based on a characterization of symmetries in the environment using group theory. We build on their work and make observations, theoretical and empirical, that lead us to argue that Symmetry-Based Disentangled Representation Learning cannot only be based on static observations: agents should interact with the environment to discover its symmetries. Our experiments can be reproduced in Colab and the code is available on GitHub.", "full_text": "Symmetry-Based Disentangled Representation\nLearning requires Interaction with Environments\n\nHugo Caselles-Dupr\u00e91,2, Michael Garcia-Ortiz2, David Filliat1\n\n1Flowers Laboratory (ENSTA Paris & INRIA), 2AI Lab (Softbank Robotics Europe)\n\ncaselles@ensta.fr, mgarciaortiz@softbankrobotics.com, david.\ufb01lliat@ensta.fr\n\nAbstract\n\nFinding a generally accepted formal de\ufb01nition of a disentangled representation\nin the context of an agent behaving in an environment is an important challenge\ntowards the construction of data-ef\ufb01cient autonomous agents. Higgins et al. (2018)\nrecently proposed Symmetry-Based Disentangled Representation Learning, a de\ufb01-\nnition based on a characterization of symmetries in the environment using group\ntheory. We build on their work and make observations, theoretical and empirical,\nthat lead us to argue that Symmetry-Based Disentangled Representation Learning\ncannot only be based on static observations: agents should interact with the envi-\nronment to discover its symmetries. Our experiments can be reproduced in Colab 1\nand the code is available on GitHub 2.\n\n1\n\nIntroduction\n\nDisentangled Representation Learning aims at \ufb01nding a low-dimensional vector representation\nof the world for which the underlying structure of the world is separated into disjoint parts (i.e.,\ndisentangled) re\ufb02ecting the compositional nature of the said world. Previous work (Higgins et al.,\n2017; Raf\ufb01n et al., 2019) has shown that agents capable of learning disentangled representations can\nperform data-ef\ufb01cient policy learning. However, there is no generally accepted formal de\ufb01nition of\ndisentanglement in Representation Learning, which prevents signi\ufb01cant progress in this emerging\n\ufb01eld.\nRecent efforts have been made towards \ufb01nding a proper de\ufb01nition (Locatello et al., 2019b). In\nparticular, Higgins et al. (2018) de\ufb01nes Symmetry-Based Disentangled Representation Learning\n(SBDRL), by taking inspiration from the successful study of symmetry transformations in Physics.\nTheir de\ufb01nition focuses on the transformation properties of the world. They argue that transformations\nthat change only some properties of the underlying world state, while leaving all other properties\ninvariant, are what gives exploitable structure to any kind of data. They distinguish between linear\nand non-linear disentangled representations, which models whether the transformation affects the rep-\nresentation in a linear or non-linear way. Supposedly, linearity should be more useful for downstream\ntasks such as Reinforcement Learning or auxiliary prediction tasks. Their de\ufb01nition is intuitive and\nprovides principled resolutions to several points of contention regarding what disentanglement is. For\nclarity, we refer to a representation as SB-disentangled if it is disentangled in the sense of SBDRL,\nand as LSB-disentangled if linear disentangled.\nWe build on the work of Higgins et al. (2018) and make observations, theoretical and empirical,\nthat lead us to argue that SBDRL requires interaction with environments. The necessity of having\ninteraction has been suggested before (Thomas et al., 2017). We propose a proof for SBDRL.\n\n1https://colab.research.google.com/drive/1KVlSV24c687N_4TLJWwGTkjt3sh9ufWW\n2https://github.com/Caselles/NeurIPS19-SBDRL\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAs in the original work, we base our experiments on a simple environment, where we can formally\nde\ufb01ne and manipulate a SB-disentangled representation. This simple environment is 2D, composed\nof one circular agent on a plane that can move left-right and up-down. The world is cyclic: whenever\nthe agent steps beyond the boundary of the world, it is placed at the opposite end (e.g. stepping up at\nthe top of the grid places the object at the bottom of the grid).\nWe prove, for this environment, that the minimal number of dimensions of the representation required\nfor it to be LSB-disentangled is counter-intuitive (i.e. 4). Indeed, the natural number of dimensions\nrequired to describe the state of the world (i.e. 2) is not enough to describe its symmetries in a\nlinear way, which is supposedly ideal for subsequent tasks. Additionally, learning a non-linear\nSB-disentangled representation is possible, but current approaches are not designed to model the\neffect of the world\u2019s symmetries on the representation, a key aspect of SBDRL which we present\nlater. We thus ask: how is one supposed to, in practice, learn a (L)SB-disentangled representation?\nWe propose two approaches that arise naturally, one where representation and world symmetries\neffect on it are learned separately and one where they are learned jointly. For both scenarios, we\nformally de\ufb01ne what could be a proper representation to learn, using the formalism of SBDRL.\nWe propose empirical implementations that are able to successfully approximate these analytically\nde\ufb01ned representations. Both empirical approaches make use of transitions (ot, at, ot+1) rather than\nstill observations ot, which validates the main point of this paper: Symmetry-Based Disentangled\nRepresentation Learning requires interaction with the environment.\nUltimately, the goal of such representations is to facilitate the learning of downstream tasks. We\nstudy the ef\ufb01ciency of (L)SB-disentangled representation on a particular downstream task: learning\nan inverse model. Our results suggests that (L)SB-disentangled indeed facilitates the learning of such\ndownstream task.\n\nFigure 1: Left: Environment studied in this paper. Right: Proposed architecture for learning a\nLSB-disentangled representation in the environment at the left as presented in section 6.2.\n\nOur contributions are therefore the following:\n\n\u2022 We prove that interaction with the environment, i.e. the use of transitions, is necessary for\n\nSBDRL, and illustrate it empirically.\n\n\u2022 We propose two alternatives for learning linear and non-linear SB-disentangled represen-\ntation in practice, both using transitions rather than still observations. Using a simple\nenvironment, we describe both solutions theoretically and validate them empirically.\n\n\u2022 We empirically demonstrate the ef\ufb01ciency of using SB-disentangled for a downstream task\n\n(learning an inverse model).\n\n2 Symmetry-Based Disentangled Representation Learning\n\nHiggins et al. (2018) de\ufb01nes Symmetry-Based Disentangled Representation Learning (SBDRL)\nin an attempt to formalize disentanglement in Representation Learning. The core idea is that SB-\ndisentanglement of a representation is de\ufb01ned with respect to a particular decomposition of the\nsymmetries of the environment. Symmetries are transformations of the environment that leave some\naspects of it unchanged. For instance, for an agent on a plane, translations of the agent on the y-axis\nleave its x coordinate unchanged. They formalize this using group theory. Groups are composed of\n\n2\n\n\fthese transformations, and group actions are the effect of the transformations on the state of the world\nand representation.\nThe proposed de\ufb01nition of SB-disentanglement supposes that these symmetries are formally de\ufb01ned\nas a group G (equipped with composition) that can be decomposed into a direct product G =\nG1 \u00d7 .. \u00d7 Gn. We now recall the formal de\ufb01nition of a SB-disentangled representation w.r.t to this\ngroup decomposition. We advise the reader to refer to the detailed work of Higgins et al. (2018)\nfor any clari\ufb01cation. Let W be a set of world-states W = (w1, .., wm) \u2208 Rm\u00d7d, where each\nstate wi is a d-dimensional vector. We suppose that there is a generative process b : W \u2192 O\nleading from world-states to observations (these could be pixel, retinal, or any other potentially\nmulti-sensory observations), and an inference process h : O \u2192 Z leading from observations to an\nagent\u2019s representations. We consider the composition f : W \u2192 Z, f = h \u25e6 b. Suppose also that there\nis a group G of symmetries acting on W via a group action \u00b7W : G \u00d7 W \u2192 W . A world is thus\nde\ufb01ned by (W,\u00b7W ). We would like to \ufb01nd a corresponding group action \u00b7Z : G \u00d7 Z \u2192 Z so that\nthe symmetry structure of W is re\ufb02ected in Z. We also want the group action \u00b7Z to be disentangled,\nwhich means that applying Gi to Z leaves all sub-spaces of Z unchanged but the one corresponding\nto the transformation Gi. Formally, the representation Z is SB-disentangled with respect to the\ndecomposition G = G1 \u00d7 .. \u00d7 Gn if:\n\n1. There is a group action \u00b7Z : G \u00d7 Z \u2192 Z.\n2. The map f : W \u2192 Z is equivariant between the group actions on W and Z:\n\n3. There is a decomposition Z = Z1 \u00d7 .. \u00d7 Zn such that each Zi is \ufb01xed by the action of all\n\nGj, j (cid:54)= i and affected only by Gi.\n\nThis de\ufb01nition of SB-disentangled representations does not make any assumptions on what form\nthe group action should take when acting on the relevant disentangled vector subspace. However,\nmany subsequent tasks may bene\ufb01t from a SB-disentangled representation where the group actions\ntransform their corresponding disentangled subspace linearly. Such representations are termed linear\nSB-disentangled representations, which we refer to as LSB-disentangled representations.\n\n3 Symmetry-Based Disentangled Representation Learning requires\n\ninteraction with environments\n\nIn this section we prove the main claim of this paper: SBDRL requires interaction with environments.\nBy \u201cinteraction with environments\u201d we refer to the fact that in order to learn a SB-disentangled\nrepresentation, one should not use a training set composed of still samples (ot, ot+1, ...), but rather\ntransitions ((ot, at, ot+1), (ot+1, at+1, ot+2), ...).\nWe begin by observing that SBDRL de\ufb01nition is actually two-fold. The de\ufb01nition of a SB-\ndisentangled representation w.r.t the decomposition G = G1 \u00d7 .. \u00d7 Gn is composed of two main\nproperties:\n\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe De\ufb01nition of a Symmetry-Based representation.\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe Disentanglement property.\n\n1. There is a group action \u00b7Z : G \u00d7 Z \u2192 Z.\n2. The map f : W \u2192 Z is equivariant be-\n\ntween the group actions on W and Z.\n\n3. There is a decomposition Z = Z1 \u00d7 .. \u00d7\nZn such that each Zi is \ufb01xed by the action\nof all Gj, j (cid:54)= i and affected only by Gi.\n\nThe \ufb01rst two points de\ufb01ne what a SB representation is. It\u2019s a representation for which the effect of\ngroup actions on the world state is the same as the effect on the representation itself. The third point\ncharacterizes what disentanglement is for a SB representation.\n\n3\n\n\fIn practice, it seems natural to \ufb01rst know how to learn a representation that satisfy the \ufb01rst two points,\ni.e. a SB representation. Based on this, we can develop methods that enforce disentanglement.\nHence we ask, how can one learn a SB representation? This task involves knowledge about how the\ngroup action affect Z. The group action is de\ufb01ned to be the effect of symmetries on the representation.\nThese symmetries can be translations, rotations, time translations, etc. In a Machine Learning\nparadigm, we would design an algorithm that learns from examples. We thus need, in practice, a\nway to apply these transformations on observations of the world (ot)t=1..n and observe the result\n(gt \u00b7Z ot = ot+1)t=1..n.\nWe thus make an analogy between the effect of a symmetry g (by the group action \u00b7W) on the\nenvironment (o1, g, g \u00b7W o1 = o2), and a transition that uses the dynamics f of the environment\n(ot, at, f (ot, at) = ot+1). It allows us to consider a more realistic scenario where we have an agent\nin an environment, and we can apply the group actions to this agent. In our analogy we simply say\nthat o1 = ot, o2 = ot+1 and at = g and \u00b7W = f.\nHowever, we do not make a total confusion between symmetries and regular actions that can be found\nin any environment. A symmetry is an element of a group (in the mathematical sense) of functions\ng : W \u2192 W , and the binary operation of the group is composition. In that sense, these functions can\neffectively be considered as actions, because actions take the environment from one state to another\nthrough the dynamics f, and symmetries take the environment from one state to another through the\ngroup action \u00b7W.\nIt is important to mention that not all actions are symmetries, for instance the action of eating a\ncollectible item in the environment is not part of any group of symmetries of the environment because\nit might be irreversible.\nMore formally, Theorem 1 provides a mathematical proof that we need interaction with environments.\nTheorem 1. Suppose we have a SB representation (f,\u00b7Z) of a world W0 = (W = (w1, .., wm) \u2208\nRm\u00d7d,\u00b7W0 ) w.r.t to G = G1 \u00d7 ... \u00d7 Gn using a training set T of unordered observations of W0. Let\nWk be the set of possible values for the kth dimension of w \u2208 W .\nThen:\n\n1. There exists at least kW,G = n[(mink(card(Wk))!] \u2212 1 worlds (W1, ..,WkW,G ) equipped\nwith the same world states Wi = (w1, .., wm) and symmetries G, but different group actions\n\u00b7Wi.\n\n2. For these worlds, (f,\u00b7Z) is not a SB representation.\n3. These worlds can produce exactly the same training set T of still images.\n\nProof. Consider two identical worlds (W1,W2) equipped with the same symmetries G. Suppose\nthat they are given two different group actions \u00b71,Z and \u00b72,Z i.e. the effect of G on W1 is not the same\nas its effect on W2. Then, given a \ufb01xed observation o, it is impossible to tell if o is an observation of\nW1 or W2. It is only possible to tell if we have access to transitions (ot, gt, ot+1)t=1..n and observe\nthe result (gt \u00b7Z zt = zt+1)t=1..n. See Appendix A.1 for full proof.\n\nUsing Theorem 1, we can deduce that for a given dataset of still images collected in a world, it is\nimpossible to describe the action of symmetries on the world. The dataset could come from a number\nof different worlds where symmetries act differently. Hence the need for transitions. For example,\nin a world where the agent can change color along a hue axis, the succession of colors can be (red,\ngreen, blue, red, ...), or (red, blue, green, red, ...). Then the world states are identical, the symmetries\nalso. Yet, the effect of the symmetries are not the same, i.e. \u00b71,W (cid:54)= \u00b72,W .\nStill, it is not clear how to discover the symmetries G of a world. Higgins et al. (2018) propose to use\nactive perception or causal manipulations of the world to empirically determine them. Having this\nin mind, we note that high-level actions in an environment often correspond to symmetries, such as\ntranslations along cartesian axis, rotations, changes of color, changes related to time (no-op action).\nActions could then be used as replacement to symmetries, and one could learn SB representations\nusing traditional transitions (ot, at, ot+1)t=1..n that are readily available in most environments. In\nthe rest of the paper, we validate this approach empirically.\n\n4\n\n\f4 Considered environment\n\nIn this paper, we consider a simpli\ufb01cation of the environment studied in (Higgins et al., 2018). This\nenvironment is 2D, composed of one circular agent on a plane that can move left-right and up-down,\nsee Fig.1. Whenever the agent steps beyond the boundary of the world, it is placed at the opposite end\n(e.g. stepping up at the top of the grid places the object at the bottom of the grid). The world-states can\nbe described in two dimensions: (x, y) position of the agent. All of our experimental results are based\non this environment. It is simple, yet presents the basis for a navigation environment in 2D. We chose\nthis environment because we are able to de\ufb01ne theoretically SB-disentangled representations, without\nmaking any approximation. We implement this simple environment using Flatland (Caselles-Dupr\u00e9\net al., 2018). The code is available in Colab3 and Github4. All architecture and hyperparameters\ndetails are speci\ufb01ed in Appendix B.\n\n5 Theoretical analysis\n\nWe \ufb01rst provide a theoretical analysis of what can be learned in the considered environment, in\nthe formalism of SBDRL. Learning a non-linear SB-disentangled representation of dimension 2 is\npossible. If (x, y) is the position of the object, then learning these two coordinates as well as the\ncyclical effect of translations is enough to create a SB-disentangled representation of dimension 2.\nHowever, it is not the case for LSB-disentangled representations. We provide a theorem that proves it\nis impossible to learn a LSB-disentangled representation of dimension 2 in the environment presented\nin Sec.4 (the result also applies to the environment considered in Higgins et al. (2018)). The key\nelement of the proof is that the two actual dimensions of the environment are not linear but cyclic.\nHence the impossibility of modelling two cyclic dimensions using two linear dimensions. See\nAppendix A.2 for full proof of the result.\nBased on this, we show next how to learn, in practice, a SB-disentangled representation of dimension\n2 and a LSB-disentangled representation of dimension 4.\n\n6 Symmetry-Based Disentangled Representation Learning in practice\n\nFigure 2: Left: First option: decoupled learning of representation and group action, here applied to\nlearning a non-linear SB-disentangled representation. Latent traversal spanning from -2 to 2 over each\nof the representation\u2019s dimensions, followed by the predicted effect of the group action associated with\neach action (left, right, down, up). Right: Second option: joint learning of representation and group\naction, here applied to learning a non-linear LSB-disentangled representation. The representation is\ncomplex: latent traversal over the phase of each of the representation\u2019s dimensions, followed by the\npredicted linear effect of the group action associated each action (down, left, up, right).\n\nWe consider the problem of learning, in practice, SB-disentangled and LSB-disentangled represen-\ntations for the world considered in Sec.4. For that, we propose two approaches: decoupled and\nend-to-end.\n\n3https://colab.research.google.com/drive/1KVlSV24c687N_4TLJWwGTkjt3sh9ufWW\n4https://github.com/Caselles/NeurIPS19-SBDRL\n\n5\n\n\fWe illustrate each method by learning a SB-disentangled representation with the decoupled approach,\nand learning a LSB-disentangled representation with the end-to-end approach.\n\n6.1 Decoupled approach (illustrated on SB-disentangled representation)\n\nWe propose to learn the representation \ufb01rst, and then the group action of G on Z using a separate\nmodel. This way, we have a complete description of the SB-disentangled representation. This\napproach is effectively decoupling the learning of physics from vision as in (Ha and Schmidhuber,\n2018).\nWe consider learning a 2-dimensional SB-disentangled representation. We started by reproducing\nthe results in (Higgins et al., 2018): we used a variant of current state-of-the-art disentangled\nrepresentation learning model CCI-VAE. The learned representation corresponds (up to a scaling\nfactor) to the world-state W , i.e.\nthe (x, y) position of the agent. This intuitively seems like a\nreasonable approximation to a disentangled representation.\nHowever, once the representation is learned, we have no idea how the group action of symmetries\naffect the representation, even though it is at the core of the de\ufb01nition of SBDRL. This is where the\nnecessity for transitions (ot, at, ot+1)t=1..n rather than still observations (ot)t=1..n comes into play.\nWe learn the group action on Z \u00b7Z : G \u00d7 Z \u2192 Z, such that f = h \u25e6 b is an equivariant map between\nthe actions on W and Z.\nIn practice, we learn h : O \u2192 Z with a variant of CCI-VAE, and then use a multi-layer perceptron to\nlearn the group action on Z. The results are presented in Fig.2, where we observe that the learned\ngroup action correctly approximates the cyclical movement of the agent. We thus have learned a\nproperly SB-disentangled representation of the world, w.r.t to the group decomposition G = Gx\u00d7 Gy.\n\n6.2 End-to-end approach (illustrated on LSB-disentangled representation)\n\nIn the decoupled approach, the learned representation is identical to a setting where we would have\nignored the group action. Hence, a preferable approach would be to jointly learn the representation and\nthe group action. We study such approach on the task of learning a LSB-disentangled representation.\nTo accomplish this, we start with a theoretically constructed LSB-disentangled representation. It is\nbased on a example given in Higgins et al. (2018). The representation is de\ufb01ned as following, using 4\ndimensions:\n\n(cid:40)\n\n\u2022 f : R2 \u2192 C2 is de\ufb01ned as f (x, y) = (e2i\u03c0x/N , e2i\u03c0y/N )\n\n\u2022 \u03c1(g) : C2 \u2192 C2 is de\ufb01ned as\n\n\u03c1(gx)(zx, zy) = (e2i\u03c0nx/N zx, zy)\n\u03c1(gy)(zx, zy) = (zx, e2i\u03c0ny/N zy)\n\nIn this representation, the (x, y) position is mapped to two complex numbers (zx, zy). For each\ntranslation (on the x-axis or y-axis), the associated group action on Z is a rotation on a complex plane\nassociated to the speci\ufb01c axis. This representation linearly accounts for the cyclic symmetry present\nin the environment.\nUsing CCI-VAE with 4 dimensions fails to learn this representation: we veri\ufb01ed experimentally that\nonly 2 dimensions were actually used when learning (for encoding the (x, y) position), and the two\nremaining were ignored.\nIn order to learn the LSB-disentangled representation, we generate a dataset of transitions, and use\nit to learn the 4-dimensional LSB-disentangled representation with a speci\ufb01c VAE architecture we\nterm Forward-VAE. This architecture allows to jointly learn the representation and the group action\non it. Here, we want the group action on Z to be linear, so we enforce linearity in transitions in the\nrepresentation space.\nWe begin by re-writing the complex-valued function \u03c1(g) : C2 \u2192 C2 as a real-valued function:\n\n\u03c1(g) :\n\nR4 \u2192 R4\nv \u2192 \u03c1(g)(v) = A\u2217(g) \u00b7 v\n\n6\n\n(1)\n\n\fwhere A\u2217(g) is a 4x4 block-diagonal matrix, composed of 2x2 rotation matrices. Let\u2019s consider the\nenvironment in Sec.4. The agent has 4 actions: go left, right, up or down. We associate each action\nwith a corresponding matrix with trainable weights.\nFor instance, if g = gx \u2208 Gx is a translation on the x-axis, the corresponding matrix is A\u2217(gx) and we\nassociate actions go right/left with corresponding matrices \u02c6A(at), where \u00b7 are trainable parameters:\n\nA\u2217(gx) =\n\n\uf8ee\uf8ef\uf8f0cos(nx) \u2212 sin(nx) 0\n\ncos(nx)\n\nsin(nx)\n\n0\n0\n\n0\n0\n\n\uf8f9\uf8fa\uf8fb and \u02c6A(gx) =\n\n\uf8ee\uf8ef\uf8f0\u00b7\n\n\u00b7\n0\n0\n\n0\n0\n0\n1 0\n0 1\n\n\uf8f9\uf8fa\uf8fb.\n\n0\n0\n0\n1\n\n\u00b7\n\u00b7\n0\n0\n\n0\n0\n1\n0\n\nWe would like the representation model that we learn to satisfy \u03c1(g)(vt) = \u02c6A(g) \u00b7 vt = vt+1. We\nthus enforce the representation to satisfy it in our Forward-VAE architecture, as illustrated in Fig.1.\nThe training procedure is presented in Algorithm 1 in Appendix C.2 For each image in a batch, we\ncompute f (ot) = zt and f (ot+1) = zt+1 using the encoder part of the VAE. Then we decode zt with\nthe decoder and compute the reconstruction loss Lreconstruction and annealed KL divergence LKL\nas in (Caselles-Dupr\u00e9 et al., 2019). Then we compute \u02c6A(at) \u00b7 zt and compute the forward loss, which\nis the MSE with zt+1: Lf orward = ( \u02c6A(at) \u00b7 zt \u2212 zt+1)2. We then backpropagate w.r.t to the full loss\nfunction of Forward-VAE:\n\nLF orward\u2212V AE = Lreconstruction + \u03b3t \u00b7 LKL + Lf orward\n\n(2)\n\nThe results are presented in Fig.2 and Appendix D. Forward-VAE correctly learns a representation\nwhere the two complex dimensions correspond to the position (x, y) of the agent. Plus, we observe\nthat the learned matrices ( \u02c6Ai)i=1..4 are very good approximation of the ideal matrices (A\u2217\ni )i=1..4\nde\ufb01ned above, with nx \u2248 \u03c0\n\n3 . The mean squared difference is very small (order of 10\u22124).\n\n6.3 Remarks\n\nNote that we could have applied this joint learning approach to learning non-linear SB-disentangled\nrepresentation. However it is not possible to apply the decoupled approach to learning a LSB-\ndisentangled representation.\nWe used inductive bias given by the theoretical construction of a LSB-disentangled representation\ntheory to design the action matrices and its trainable weights. This construction is speci\ufb01c to this\nexample. However, the idea of having an action matrix for each action is extendable. If each action is\nhigh-level and associated to a symmetry, then SBDRL can be performed. Still, it requires high level\nactions that represent these symmetries. One potential way to \ufb01nd these actions is through active\nsearch (Soatto, 2011), as suggested in (Higgins et al., 2018).\nIn our Forward-VAE architecture we indeed explicitly design the model such that the resulting\nrepresentation is Linear SB-disentangled, because we enforce linearity, force the representation to be\nSB (see points 1 and 2 in the de\ufb01nition in Sec.3) and by design have two separate subspaces for each\nsymmetry. A more general approach would have been not to have those two separated subspaces\nand learn the entire action matrices, and thus we won\u2019t have the guarantee that the representation\nwill satisfy the disentangled property. We ran this experiment and obtained the expected result: the\nlearned representation is Linear-SB but not disentangled. This means that the x and y coordinates are\nnot properly disentangled w.r.t to the considered group decomposition (i.e. a latent traversal over each\ndimension would not result in only a movement of the agent along the x or y coordinate). However\nthe learned actions matrices are able to describe how the symmetries affect the representation and\nin a linear way. Hence enforcing disentanglement is the only viable option we found for LSB-\ndisentanglement with this architecture.\nIt is important to note that an instability in Forward-VAE training can be expected due to the different\ncontributions of the loss: at each training steps the goal of the forward part of the loss is to have a\nlatent space that is suited for predicting zt+1 using zt. The rest of the loss is the VAE, which tries\nto learn a latent space that allows reconstruction. Hence the balance between these two seemingly\nunrelated objectives might be a source of instability. However it worked in practice, without any\nre-weighting of the objectives, which was a surprise.\n\n7\n\n\f7 Using (L)SB-disentangled representations for downstream tasks\n\nIs using (L)SB-disentangled representations bene\ufb01cial for subsequent tasks? This remains to be\ndemonstrated, as other work have already challenged the bene\ufb01t of learning disentangled repre-\nsentations over non-disentangled ones (Locatello et al., 2019b). In this section we wish to an-\nswer the following question: is it increasingly better to use non-disentangled/non-linear SB-\ndisentangled/LSB-disentangled representation for downstream tasks? We de\ufb01ne better in terms\nof \ufb01nal performance, under different settings (restricted capacity classi\ufb01ers/restricted amount of data).\nFor the choice of downstream task, we select the task of learning an inverse model, which consists in\npredicting the action at from two consecutive states (st, st+1).\nAs a LSB-disentangled representation models the interaction with the environment linearly, it intu-\nitively should be increasingly easier to learn an inverse model from: a non-disentangled representation,\na non-linear SB-disentangled representation, and a LSB-disentangled representation.\n\n7.1 Experimental protocol\n\nIn order to test this hypothesis, we selected a well-established implementation (Scikit-learn (Pedregosa\net al., 2011)) of a well-studied classi\ufb01er (Random Forest (Breiman, 2001)). We collect 10k transitions\n(ot, at, ot+1). We train the following models and baselines to compare:\n\n\u2022 LSB-disentangled representation of dimension 4: Forward-VAE trained as in Sec.6.2.\n\u2022 SB-disentangled representation of dimension 2: CCI-VAE variant trained as in Sec.6.1.\n\u2022 Non-disentangled representation of dimension 2: Auto-encoder, non-disentangled baseline.\n\u2022 SB-disentangled representation of dimension 4: CCI-VAE trained as in Sec.6.1 but with 4\n\ndimensions, baseline to control for the effect of the size of the representation.\n\nFor each model, once trained, we created a dataset of transitions in the corresponding representation\nspace (st, at, st+1). We then report the 10-fold cross-validation mean accuracy as a function of the\nmaximum depth parameter of random forest, which controls the capacity of the classi\ufb01er.\n\n(a) Dataset size: 1k samples\n\n(b) Dataset size: 10k samples\n\nFigure 3: Downstream task evaluation of representation models: inverse model prediction. Mean\n10-fold cross validation accuracy as functions of dataset size and classi\ufb01er capacity (max depth\nparameter of Random Forest). LSB and SB-disentangled representation perform best.\n\n7.2 Results\n\nWe \ufb01rst observe that in all cases, either LSB or SB-disentangled representations are performing best.\nIn terms of \ufb01nal performance, all models meet at the upper 100% accuracy limit, given enough data\nand a classi\ufb01er with enough capacity.\nHowever, if we consider a constraint in training set size and a \ufb01xed high capacity classi\ufb01er (see Fig.3),\nwe can see that using a SB-disentangled representation is superior to other options. We refer to the\ncapacity of the classi\ufb01er as \"high\" if increasing the capacity parameter does not lead to an increase in\nvalidation accuracy.\nMoreover, if we consider a \ufb01xed large training set size and a constraint on the classi\ufb01er\u2019s capacity,\nusing LSB-disentangled representation is the best option.\n\n8\n\n2468101214Max depth of Random Forest classifier0.30.40.50.60.70.80.91.0Mean validation accuracy (10-fold cross validation)CCI-VAE (non-linear disentangled, dim=2)CCI-VAE (non-linear disentangled, dim=4)Auto-Encoder (non-disentangled, dim=2)Forward-VAE (linear disentangled, dim=4)2468101214Max depth of Random Forest classifier0.30.40.50.60.70.80.91.0Mean validation accuracy (10-fold cross validation)\fAs a conclusion, we observed that it is easier for a small capacity classi\ufb01er to solve the task\nusing a LSB-disentangled representation and it is easier to solve the task using less data with a\nSB-disentangled representation. This indicates that (L)SB-disentanglement is indeed useful for\ndownstream task solving.\n\n7.3 Remarks\n\nIt\u2019s worth noting that the advantage is not very substantial, which is expected due to the simplicity\nof the task. Our results on usefulness of (L)SB-disentangled representations for downstream tasks\nare preliminary, it would be interesting as future work to compare to more baselines and on more\ntasks. Other related works such as van Steenkiste et al. (2019); Locatello et al. (2019a) also study the\nusefulness of disentangled representations for downstream tasks, and respectively \ufb01nd them useful for\nperformance in abstract visual reasoning tasks and for encouraging fairness when sensitive variables\nare not observed.\nMore generally, there is a lack of large-scale evaluations of representations\u2019 usefulness for downstream\ntasks in the disentanglement representation learning literature. Such studies are needed to validate\nthe intuition that disentanglement is useful in practice for subsequent tasks.\n\n8 Discussion & Conclusion\n\nDiscussion. The bene\ufb01t of using transitions rather than still observations for representation learning\nin the context of an agent acting in a environment has been proposed, discussed and implemented\nin previous work (Thomas et al., 2017; Raf\ufb01n et al., 2019). In this work however, we emphasize\nthat using transitions is not only a bene\ufb01cial option, but is compulsory in the context of the current\nde\ufb01nition of SBDRL for an agent acting in an environment, as Theorem 1 proves it.\nApplying SBDRL to more complex environments is not straightforward. For instance consider\nthat we add an object in the environment studied in this paper. Then the group structure of the\nsymmetries of the world are broken when the agent is close to the object. However, the symmetries\nare conserved locally. One approach would be to start from this local property to learn an approximate\nSB-disentangled representation.\n\nConclusion. Using theoretical and empirical arguments, we demonstrated that SBDRL (Higgins\net al., 2018), a proposed de\ufb01nition for disentanglement in Representation Learning, requires inter-\naction with the environment. We then proposed two methods to perform SBDRL in practice, both\nof which are successful empirically. We believe SBDRL provides a new perspective on disentan-\nglement which can be promising for Representation Learning in the context of an agent acting in a\nenvironment.\n\nAcknowledgements\n\nWe thank Irina Higgins for insightful mail discussions.\n\nReferences\nLeo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5\u201332.\n\nHugo Caselles-Dupr\u00e9, Louis Annabi, Oksana Hagen, Michael Garcia-Ortiz, and David Filliat. 2018.\nFlatland: a Lightweight First-Person 2-D Environment for Reinforcement Learning. Workshop on\nContinual Unsupervised Sensorimotor Learning, ICDL-EpiRob 2018 (2018).\n\nHugo Caselles-Dupr\u00e9, Michael Garcia-Ortiz, and David Filliat. 2019. S-TRIGGER: Continual State\nRepresentation Learning via Self-Triggered Generative Replay. arXiv preprint arXiv:1902.09434\n(2019).\n\nDavid Ha and J\u00fcrgen Schmidhuber. 2018. Recurrent world models facilitate policy evolution. In\n\nAdvances in Neural Information Processing Systems. 2450\u20132462.\n\n9\n\n\fIrina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and\nAlexander Lerchner. 2018. Towards a De\ufb01nition of Disentangled Representations. arXiv preprint\narXiv:1812.02230 (2018).\n\nIrina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew\nBotvinick, Charles Blundell, and Alexander Lerchner. 2017. Darla: Improving zero-shot trans-\nfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70. JMLR. org, 1480\u20131490.\n\nDiederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. 3rd\n\nInternational Conference for Learning Representations (2014).\n\nFrancesco Locatello, Gabriele Abbati, Tom Rainforth, Stefan Bauer, Bernhard Sch\u00f6lkopf, and Olivier\n\nBachem. 2019a. On the Fairness of Disentangled Representations. NeurIPS 2019 (2019).\n\nFrancesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch\u00f6lkopf,\nand Olivier Bachem. 2019b. Challenging Common Assumptions in the Unsupervised Learning of\nDisentangled Representations. In International Conference on Machine Learning. 4114\u20134124.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in\npytorch. (2017).\n\nF. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-\nhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,\nand E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning\nResearch 12 (2011), 2825\u20132830.\n\nAntonin Raf\ufb01n, Ashley Hill, Kalifou Ren\u00e9 Traor\u00e9, Timoth\u00e9e Lesort, Natalia D\u00edaz-Rodr\u00edguez, and\nDavid Filliat. 2019. Decoupling feature extraction from policy learning: assessing bene\ufb01ts of\nstate representation learning in goal based robotics. ICLR 2019 Workshop on Structure & Priors\nin Reinforcement Learning (SPiRL) (2019).\n\nStefano Soatto. 2011.\n\nSteps Towards a Theory of Visual Information: Active Perception,\nSignal-to-Symbol Conversion and the Interplay Between Sensing and Control. Technical Report.\n\nValentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean\nIndependently Controllable\n\nMeurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. 2017.\nFeatures. arXiv preprint arXiv:1708.01289 (2017).\n\nSjoerd van Steenkiste, Francesco Locatello, J\u00fcrgen Schmidhuber, and Olivier Bachem. 2019. Are\n\nDisentangled Representations Helpful for Abstract Visual Reasoning? NeurIPS 2019 (2019).\n\n10\n\n\f", "award": [], "sourceid": 2585, "authors": [{"given_name": "Hugo", "family_name": "Caselles-Dupr\u00e9", "institution": "Flowers Laboratory (ENSTA ParisTech & INRIA) & Softbank Robotics Europe"}, {"given_name": "Michael", "family_name": "Garcia Ortiz", "institution": "SoftBank Robotics Europe"}, {"given_name": "David", "family_name": "Filliat", "institution": "ENSTA"}]}