{"title": "Unsupervised Emergence of Egocentric Spatial Structure from Sensorimotor Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 7158, "page_last": 7168, "abstract": "Despite its omnipresence in robotics application, the nature of spatial knowledge and the mechanisms that underlie its emergence in autonomous agents are still poorly understood. Recent theoretical works suggest that the Euclidean structure of space induces invariants in an agent\u2019s raw sensorimotor experience. We hypothesize that capturing these invariants is beneficial for sensorimotor prediction and that, under certain exploratory conditions, a motor representation capturing the structure of the external space should emerge as a byproduct of learning to predict future sensory experiences. We propose a simple sensorimotor predictive scheme, apply it to different agents and types of exploration, and evaluate the pertinence of these hypotheses. We show that a naive agent can capture the topology and metric regularity of its sensor\u2019s position in an egocentric spatial frame without any a priori knowledge, nor extraneous supervision.", "full_text": "Unsupervised Emergence of Egocentric Spatial\n\nStructure from Sensorimotor Prediction\n\nAlban La\ufb02aqui\u00e8re\n\nAI Lab, SoftBank Robotics Europe\n\nParis, France\n\nalaflaquiere@softbankrobotics.com\n\nmgarciaortiz@softbankrobotics.com\n\nMichael Garcia Ortiz\n\nAI Lab, SoftBank Robotics Europe\n\nParis, France\n\nAbstract\n\nDespite its omnipresence in robotics application, the nature of spatial knowledge\nand the mechanisms that underlie its emergence in autonomous agents are still\npoorly understood. Recent theoretical works suggest that the Euclidean structure of\nspace induces invariants in an agent\u2019s raw sensorimotor experience. We hypothesize\nthat capturing these invariants is bene\ufb01cial for sensorimotor prediction and that,\nunder certain exploratory conditions, a motor representation capturing the structure\nof the external space should emerge as a byproduct of learning to predict future\nsensory experiences. We propose a simple sensorimotor predictive scheme, apply\nit to different agents and types of exploration, and evaluate the pertinence of these\nhypotheses. We show that a naive agent can capture the topology and metric\nregularity of its sensor\u2019s position in an egocentric spatial frame without any a priori\nknowledge, nor extraneous supervision.\n\n1\n\nIntroduction\n\nCurrent model-free Reinforcement Learning (RL) approaches have proven to be very successful at\nsolving dif\ufb01cult problems, but seem to lack the ability to extrapolate and transfer already acquired\nknowledge to new circumstances [7, 33]. One way to overcome this limitation would be for learning\nagents to abstract from the data a model of the world that could support such extrapolation. For\nagents acting in the world, such an acquired model should include a concept of space, such that the\nspatial properties of the data they collect could be disentangled and extrapolated upon.\nThis problem naturally raises the question of the nature of space and how this abstract concept\ncan be acquired. This question has already been addressed philosophically by great minds of the\npast [18, 36, 31], among which the approach proposed by H.Poincar\u00e9 is of particular interest, as it\nnaturally lends itself to a mathematical formulation and concrete experimentation. He was interested\nin understanding why we perceive ourselves as being immersed in a 3D and isotropic (Euclidean)\nspace when our actual sensory experiences live in a multidimensional space of a different nature and\nstructure (for instance, when the environment is projected on the \ufb02at heterogeneous surface of our\nretina). He suggested that the concept of space emerges via the discovery of compensable sensory\nchanges that are generated by a change in the environment but can be canceled-out by a motor change.\nThis compensability property applies speci\ufb01cally to displacements of objects in the environment\nand of the sensor, but not to non-spatial changes (object changing color, agent changing its camera\naperture...). For instance, one can compensate the sensory change due to an object moving 1 meter\naway by moving 1 meter toward the object. Moreover, this compensability property is invariant to\nthe content of the environment, as the displacement of an object can be compensated by the same\nmotor change regardless of the type of the object. One can thus theoretically derive from the structure\nunderlying these compensatory motor changes a notion of space abstracted from the speci\ufb01c sensory\ninputs that any given environment\u2019s content induces.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis philosophical stance has inspired recent theoretical works on the perception of space, and has in\nparticular been coupled with the SensoriMotor Contingencies Theory (SMCT), a groundbreaking\ntheory of perception that gives prominence to the role of motor information in the emergence of\nperceptive capabilities [32]. It led to theoretical results regarding the extraction of the dimension\nof space [23], the characterization of displacements as compensable sensory variations [38], the\ngrounding of the concept of point of view in the motor space [24, 25], as well as the characterization\nof the metric structure of space via sensorimotor invariants [26]. These theoretical works suggest that\nan egocentric concept of space should emerge \ufb01rst, and that it could be grounded in the motor space\nas a way to economically capture the sensorimotor invariants that space induces. Our goal is thus\nto study how an unsupervised agent can build an internal representation of its sensor\u2019s egocentric\nspatial con\ufb01guration akin to the (x, y, z) Euclidean description that would otherwise be provided by a\nhand-designed model. This implies capturing the topology and regular metric structure of the external\nspace in which the sensor moves in a way that does not depend on the content of the environment.\nThis basic egocentric representation would be a solid foundation for the development of richer spatial\nknowledge and reasoning (ex: navigation, localization...).\nThe contribution of this work is to cast the aforementioned theoretical works in an unsupervised\n(self-supervised) Machine Learning frame. We further develop the formalization of space-induced\nsensorimotor invariants, and show that a representation capturing the topological and metric structure\nof space can emerge as a by-product of sensorimotor prediction. These results shed new light on the\nfundamental nature of spatial perception, and give some insight on the autonomous grounding of\nspatial knowledge in a naive agent\u2019s sensorimotor experience.\n\n2 Related work\n\nOnly a quick overview of the large literature related to spatial representation learning is given here,\nleaving aside approaches where the spatial structure of the problem is largely hard-coded [6].\nThe problem is often conceptualized as the learning of grid or place cells, inspired by neuroscience [4].\nPlace cells have been built as a way to compress sensory information [2], or to improve sensorimotor\nand reward predictability [42, 37, 11]. Grid cells have been built as an intermediary representations\nin recurrent networks trained to predict an agent\u2019s position [3, 8]. Both place and grid cells have also\nbeen extracted by processing the internal state of a reservoir [1]. Representations akin to place cells\nand displacements have also been built from low-level sensorimotor interaction [21]. Theses works\nrely on the extraneous de\ufb01nition of \u201cspatial\u201d inductive bias or hand-designed loss functions.\nIn RL, state representation learning is often used to solve spatial tasks (ex: navigation). Some note-\nworthy works build states based on physical priors [16], system controllability [40], action sequencing\n[5], or disentanglement of controllable factors in the data [39]. Many end-to-end approaches are\nalso applied to spatial problems without explicitly building spatial representations [29, 17], although\nauxiliary tasks are sometimes used to induce spatial constraints during training [28]. These works\nonce again rely on hand-designed priors to obtain spatial-like representations.\nLike in this work, forward sensorimotor predictive models are learned in many methods to compress\nsensory inputs, improve policy optimization, or derive a curiosity-like reward [12, 9, 41, 34]. Such\nforward models are also at the core of body schema learning approaches [15, 22]. Closer to this work,\nan explicit representation of displacements is built in [10] by integrating motor sequences for sensory\nprediction. However these works do not study how spatial structure can be captured in such models.\nDifferent \ufb02avors of Variational Auto-Encoders have been used to encode \u201cspatial\u201d factors of variation\nin a latent representation [13]; a work that interestingly led to the de\ufb01nition of disentanglement of\nspatial factors as invariants in an agent\u2019s experience [14]. These works however ignore the motor\ncomponent of the problem.\nFinally, this work is in line with the theoretical developments of [35, 23\u201325, 38, 26, 27], which\naddress the fundamental problem of space perception in the framework of the SMCT, but frame\nthem in an unsupervised machine learning framework. We show that the structure of space can get\nnaturally captured as a by-product of sensorimotor prediction.\n\n3 Problem setup\n\nLet\u2019s consider an agent and an environment immersed in space. The agent has a \ufb01xed base, and\nis equipped with a sensor that it can move to explore its environment. It has only access to raw\n\n2\n\n\fsensorimotor experiences, and has no a priori knowledge about the world and the external space.\nLet m \u2208 RNm be the static con\ufb01guration of its motors1, referred to as motor state. Let s \u2208 RNs be\nthe reading of its exteroceptive sensor, referred to as sensory state. Let \u0001 \u2208 RN\u0001 be the state of the\nenvironment de\ufb01ning both its spatial and non-spatial properties. Finally, let p \u2208 RNp be the external\nposition of the sensor in an egocentric frame of reference centered on the agent\u2019s base. This space of\npositions is assumed to be a typical Euclidean space with a regular topology and metric. Our goal\nis to build, from raw sensorimotor experiences (m, s), an internal representation h \u2208 RNh which\ncaptures the topological and metric structure of p \u2208 RNp.\nInspired by Poincar\u00e9\u2019s original insight and borrowing from the formalism of [35], we assume that\nthe agent\u2019s sensorimotor experience can be modeled as a continuous mapping parametrized by the\nstate of the environment: s = \u03c6\u0001(m). The mapping \u03c6 represents all the constraints that the unknown\nstructure of the world imposes on the agent\u2019s experience. In particular, it incorporates the structure\nof the space in which the agent and the environment are immersed. It has been shown that this\nstructure manifests itself as invariants in the sensorimotor experience [26]. We reformulate here these\ninvariants in a more compact way, taking advantage of the continuity of the sensorimotor mapping \u03c6.\nAn intuitive description of them is given below, with a more complete mathematical derivation in\nAppendix A.\nTopological invariants: The topology (and in particular the dimensionality) of RNp is a priori\ndifferent from the one of RNm and RNs. Yet, assuming no consistent sensory ambiguity between\ndifferent sensor positions in the environments the agent explores, the sensory space experienced by\nthe agent in each environment is a manifold, embedded in RNs, which is homeomorphic to the space\nRNp. Intuitively, this means that small displacements of the sensor are associated with small sensory\nchanges, and vice versa, for any environmental state \u0001. From a motor perspective, this implies that\nmotor changes associated with small sensory changes correspond to small external displacements:\n(1)\nwhere |.| denotes a norm, and \u00b5 is a small value. The topology of RNp is thus accessible via\nsensorimotor experiences, and constrains how different motor states get mapped to similar sensory\nstates. In the particular case of a redundant motor system, the multiple m which lead to the same\nsensor position p all generate the same sensory state s for any environmental state \u0001. The agent has\nthus access to the fact that the manifold of sensory states, and thus the space of sensor positions, is\nof lower dimension than the one of its motor space. Note that these relations are invariant to the\nenvironmental state \u0001.\nWe hypothesize that these topological invariants should be accessible to the agent under condition I:\nwhen exploring the world, the agent should experience consistent sensorimotor transitions (mt, st) \u2192\n(mt+1, st+1) such that the state of the environment \u0001 stays unchanged during the transition.\nMetric invariants: The metric of RNp is a priori different from the metric of RNm and RNs.\nYet, the metric regularity of the external space is accessible in the sensorimotor experience if the\nenvironment also undergoes displacements [26]. Indeed, let\u2019s consider two different sensory states st\nand st+1 associated with two motor states mt and mt+1 when the environment is in a \ufb01rst position\n\u0001. The same sensory states can be re-experienced with two different motor states mt(cid:48) and mt(cid:48)+1\nafter the environment moved rigidly to a new position \u0001(cid:48). This is the compensability property of\ndisplacements coined by H.Poincar\u00e9. Thus, the consequence of the environment moving rigidly\nrelatively to the agent\u2019s base is that equivalent displacements of the sensor in the external Euclidean\nspace \u2212\u2212\u2212\u2212\u2192ptpt+1 = \u2212\u2212\u2212\u2212\u2212\u2192pt(cid:48)pt(cid:48)+1 can generate the same sensory change st \u2192 st+1 for different positions of\nthe environment. In turn, the different motor changes mt \u2192 mt+1 and mt(cid:48) \u2192 mt(cid:48)+1 generating\nequivalent sensor displacements are associated with the same sensory changes for different positions\nof the environment. It ensues that (see Appendix A for the complete development):\n\n\u2200\u0001, |\u03c6\u0001(mt) \u2212 \u03c6\u0001(mt+1)| (cid:28) \u00b5 \u21d4 |st \u2212 st+1| (cid:28) \u00b5 \u21d4 |pt \u2212 pt+1| (cid:28) \u00b5,\n\n(cid:26)|\u03c6\u0001(mt) \u2212 \u03c6\u0001(cid:48)(mt(cid:48))| (cid:28) \u00b5\n\n\u2200\u0001, \u0001(cid:48),\n\n|\u03c6\u0001(mt+1) \u2212 \u03c6\u0001(cid:48)(mt(cid:48)+1)| (cid:28) \u00b5\n\n(2)\nThese relations are once again invariant to the states \u0001 and \u0001(cid:48), as long as \u0001 \u2192 \u0001(cid:48) corresponds to a\nglobal rigid displacement of the environment. The metric regularity of RNp is thus accessible via\nsensorimotor experiences, and constrains how different motor changes get mapped to similar sensory\nchanges, for different positions of the environment.\n\n\u21d4 |(pt+1 \u2212 pt) \u2212 (pt(cid:48)+1 \u2212 pt(cid:48))| (cid:28) \u00b5.\n\n1If the body is not controlled in position, m can be a proprioceptive reading of the body con\ufb01guration.\n\n3\n\n\fWe hypothesize that these metric invariants should be accessible to the agent under condition II: the\nagent should experience displacements of the environment \u0001 \u2192 \u0001(cid:48) in-between consistent sensorimotor\ntransitions (mt, st) \u2192 (mt+1, st+1) and (mt(cid:48), st(cid:48)) \u2192 (mt(cid:48)+1, st(cid:48)+1).\nSpace thus induces an underlying structure in the way motor states map to sensory states. Interestingly,\nthe sensory component of this structure varies for different environments, but not the motor one: a\nsingle m is always associated with the same egocentric position p, regardless of the environmental\nstate \u0001, while the associated s changes with \u0001. A stable representation of p can then be grounded in\nthe motor space (as already argued in [25]), and shaped by sensory experiences which reveal this\nstructure.\nIn this work, we hypothesize that capturing space-induced invariants is bene\ufb01cial for sensorimotor\nprediction, as they can act as acquired priors over sensorimotor transitions no yet experienced. For\ninstance, imagine two motor states ma and mb have always been associated with identical sensory\nstates in the past. Then encoding them with the same representation ha = hb can later help the\nagent extrapolate that if ma is associated with a previously unseen sensory state sa, then mb will\nalso be. Therefore, we propose to train a neural network to perform sensorimotor prediction, and\nto analyze how it learns to encode motor states depending on the type of exploration that generates\nthe sensorimotor data. We expect this learned representation to capture the topology of p \u2208 RNp\nwhen condition I is ful\ufb01lled, and to capture its metric regularity when condition II is ful\ufb01lled, without\nextraneous constraint nor supervision.\n\n4 Experiments\n\nSensorimotor predictive network: We propose a simple neural network architecture to perform\nsensorimotor prediction. The network\u2019s objective is to predict the sensory outcome st+1 of a future\nmotor state mt+1, given a current motor state mt and sensory state st. Additionally, we want both\nmotor states mt and mt+1 to be encoded in the same representational space. As illustrated in Fig. 1,\nthe network is thus made of two types of modules: i) Netenc, a Multi-Layer Perceptron (MLP) taking\na motor state mt as input, and outputting a motor representation ht of dimension Nh, and ii) Netpred,\na MLP taking as input the concatenation of a current representation ht, a future representation ht+1,\nand a current sensory state st, and outputting a prediction for the future sensory state \u02dcst+1. The\noverall network connects a predictive module Netpred to two siamese copies of a Netenc module,\nensuring that both mt and mt+1 are encoded the same way. The loss function to minimize is the\nMean Squared Error (MSE) between the sensory prediction and the ground truth:\n\nLoss =\n\n1\nK\n\n|\u02dcs(k)\nt+1 \u2212 s(k)\n\nt+1|2,\n\n(3)\n\nwhere K is the number of sensorimotor transitions collected by the agent. No extra component\nis added to the loss regarding the structure of the representation h. Unless stated otherwise, the\ndimension Nh is arbitrarily set to 3 for the sake of visualization. A more thorough description of the\nnetwork and training procedure is available in Appendix B.\n\nAnalysis of the motor representation: We use two measures Dtopo and Dmetric to assess how\nmuch the structure of the representation h built by the network differs from the one of the sensor\nposition p. The \ufb01rst corresponds to an estimation of the topological dissimilarity between a set in\nRNp and the corresponding set in RNh:\n\nK(cid:88)\n\nk=1\n\nN(cid:88)\n\n(cid:12)(cid:12)hi \u2212 hj\n(cid:12)(cid:12)\n(cid:16)(cid:12)(cid:12)hk \u2212 hl\n\n(cid:12)(cid:12)(cid:17) . exp\n\n\uf8eb\uf8ed \u2212\u03b1.(cid:12)(cid:12)pi \u2212 pj\n(cid:12)(cid:12)\n(cid:16)(cid:12)(cid:12)pk \u2212 pl\n\nmaxkl\n\n(cid:12)(cid:12)(cid:17)\n\n\uf8f6\uf8f8 ,\n\nDtopo =\n\n1\nN 2\n\ni,j\n\nmaxkl\n\n(4)\n\nwhere |.| denotes the Euclidean norm, N is the number of samples in each set, and \u03b1 is arbitrarily set to\n50. This measure is large when close sensor positions p are encoded by distant motor representations\nh, and small otherwise.\nThe second measure corresponds to an estimation of the metric dissimilarity between those same two\nsets. At this point, it is important to notice that the metric invariants described in Sec. 3 only imply\nrelative distance constraints (see also [26]). Consequently, any representation h related to p via an\naf\ufb01ne transformation would respect these constraints2. In order to properly assess if the two sets share\n\n2An af\ufb01ne transformation preserves topology and distance ratios.\n\n4\n\n\fFigure 1: (a) The neural network architecture featuring two siamese instances of the Netenc module,\nand a Netpred module. (b-c) Illustrations of the Discrete world and Arm in a room simulations. The\n\ufb01xed base of the agent is represented by a red square. The working space reachable by the sensor\nis framed in dotted red. The sensor and its current egocentric position p = [x, y] in this frame are\ndisplayed in blue. The environment can translate relatively to the agent\u2019s base, which is equivalent to\nan (opposite) displacement of the agent\u2019s base itself, as illustrated by red arrows. (Best seen in color)\n\nthe same metric regularity, we \ufb01rst perform a linear regression of p on h to cancel out the potential\naf\ufb01ne transformation between the two. We denote h(p) = A.h + b the resulting projection of h in\nRNp, where A and b are the optimal parameters of the linear regression. The second dissimilarity\nDmetric is then de\ufb01ned as:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)h(p)\n\nN(cid:88)\n\ni,j\n\nDmetric =\n\n1\nN 2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12) \u2212(cid:12)(cid:12)pi \u2212 pj\n(cid:16)(cid:12)(cid:12)pk \u2212 pl\n(cid:12)(cid:12)(cid:17)\n\ni \u2212 h(p)\nmaxkl\n\nj\n\n.\n\n(5)\n\nThis measure is large when the distance between two sensor positions p differs from the distance be-\ntween the two corresponding motor representations h (after af\ufb01ne projection), and is small otherwise.\nIt is equal to zero when there exists a perfect af\ufb01ne mapping between h and p; in which case the two\nsets have equivalent metric regularities.\nNote that in (4) and (5), distances are normalized by the largest distance in the corresponding space\nin order to avoid undesired scaling effects. In the following, the dissimilarities are computed on sets\nof p and corresponding h generated by sampling the motor space in a \ufb01xed and regular fashion (see\nFig. 3). This ensures a rigorous comparison of their values between epochs and between runs.\n\nTypes of exploration: Three types of exploration of the environment are considered to test the\nhypotheses laid out in Sec. 3. They correspond to different ways to generate the sensorimotor\ntransitions (mt, st) \u2192 (mt+1, st+1) fed to the network during training:\nInconsistent transitions in a moving environment: The motor space is randomly sampled, and the\nenvironment randomly moves between t and t + 1. Both conditions I and II are broken, as the\nagent explores a constantly moving environment, such that its sensorimotor transitions have no\nspatiotemporal consistency3. We refer to this type of exploration as MEM (Motor-Environment-\nMotor), in agreement with the order of changes for each transition.\nConsistent transitions in a static environment: The motor space is sampled randomly, and the\nenvironment stays static. Condition I is ful\ufb01lled, as the agent experiences spatiotemporally consistent\ntransitions in a static environment, but not condition II, as the environment does not move between\ntransitions. We refer to this type of exploration as MM (Motor-Motor).\nConsistent transitions in a moving environment: The motor space is randomly sampled, and the\nenvironment randomly moves after each transition (after t + 1). Both conditions I and II are ful\ufb01lled,\nas the agent experiences spatiotemporally consistent transitions, and the environment moves between\ntransitions. We refer to this type of exploration as MME (Motor-Motor-Environment).\nAdditional details on the sampling procedure are available in Appendix B. According to Sec. 3, we\nexpect the sensorimotor data to contain no spatial invariants in the MEM case, topological invariants\nin the MM case, and both topological and metric invariants in the MME case.\n\n3It is akin to the kind of data a passive and non-situated agent receives in typical machine learning settings.\n\n5\n\n(b) Discrete worldsiamese(a) Neural network architecture200 selu150 selu100 selulu150 selu100 selu50 selulu150 selu100 selu50 selulu(c) Arm in a room\fFigure 2: Evolution of the loss and the dissimilarity measures Dtopo and Dmetric during training for\nboth setups, and for the three types of exploration. The displayed means and standard deviations are\ncomputed over 50 independent runs. (Best seen in color)\n\nAgent-Environment setups: We simulate two different agent-environment setups: Discrete world\nand Arm in a room. The \ufb01rst one is a minimalist arti\ufb01cial setup designed to test our hypotheses in\noptimal conditions. It corresponds to a 2D grid-world environment that the agent can explore by\ngenerating 3D motor states m which map, in a non-linear and redundant way, to 2D positions p of\nthe sensor in the grid (see Fig. 1(b)). For each position the agent receives a 4D sensory input s that is\ndesigned to vary smoothly over the grid and to have no sensory ambiguity between different positions.\nThe whole grid can translate with respect to the agent\u2019s base, changing its state \u0001, and acts as a torus\nto avoid any border effect.\nThe second one is a more complex and realistic setup in which a three-segment arm equipped with\na camera explores a 3D room \ufb01lled with random objects (see Fig. 1(c)). The agent can change its\ncamera position p in a horizontal plane (with a \ufb01xed orientation) by generating 3D motor states m,\nand receives sensory inputs s of size 768 (16 \u00d7 16 RGB pixels). The whole room can translate in 2D\nwith respect to the agent\u2019s base, changing its state \u0001, and the arm cannot move outside of the room.\nA more complete description of the simulations is available in Appendix C.\n\n5 Results\n\nWe evaluate the three types of exploration on the two experimental setups. Each simulation is run 50\ntimes, with all random parameters drawn independently on each trial. During training, the measures\nDtopo and Dmetric are evaluated on a \ufb01xed regular sampling of the motor space. Their evolution,\nas well as the evolution of the loss, are displayed in Fig. 2. Additionally, Fig. 3 shows the \ufb01nal\nrepresentations h of the regular motor sampling, for one randomly selected trial of each simulation.\nThe corresponding positions p and the projection h(p) are displayed in the same space in order to\nvisualize how much their metric structures differ.\n\nDiscrete world results: The results clearly show an impact of the type of exploration on the motor\nencoding built by the network. First of all, as expected, the loss is high in the MEM case because\nthe constant movements of the environment prevent any accurate sensorimotor prediction. On the\ncontrary, it is low in the MM and MME cases, as the consistency of transitions enables accurate\nsensorimotor prediction (upper-bounded by the expressivity power of the Netpred module).\nMore interestingly, the topological dissimilarity Dtopo has a signi\ufb01cantly smaller value and variance\nin the MM and MME cases than in the MEM case. This seems to indicate that the MEM exploration\nleads to arbitrary representations, while the topologies of h and p are more similar when the agent\ncan experience consistent sensorimotor transitions (mt, st) \u2192 (mt+1, st+1) during which the\nenvironment does not move.\nSimilarly, the metric dissimilarity Dmetric is high in the MEM case, average in the MM case, and\nlow in the MME case. This intermediate value in the MM case is due to the fact that capturing the\ntopology of p also indirectly reduces the metric dissimilarity. However, the very low value in the\nMME case seems to indicate that the metric of h displays a regularity which is similar to the one of p.\nThis regularity is thus captured only when the agent can experience movements of the environment\n\n6\n\n\fFigure 3: Visualization of the normalized regular motor sampling m (blue dots), its representation h\nin the representational space (red dots), and the corresponding ground truth position p (blue circles)\nfor the three types of exploration, and for both the Discrete world (a) and Arm in a room (b) setups.\nThe linear regression h(p) of the representations are also displayed in the space of positions. Lines\nhave been added to visualize the distances between each h(p) and its ground truth counterpart p.\nIn (a), the regular (but non-linear) 3D motor sampling generates a regular 2D grid of positions,\nwhere each position corresponds to 5 redundant motor states. In (b), the regular 3D motor sampling\ngenerates a star-shaped set of positions, where some positions, like the inner corners of the star,\ncorrespond to multiple motor states. (Best seen in color)\n\nbetween consistent sensorimotor transitions.\nThis analysis is con\ufb01rmed in Fig. 3 where h and its af\ufb01ne projection h(p) in RNp display an arbitrary\nstructure in the MEM case, a structure topologically equivalent to the one of p in the MM case, and a\nstructure topologically and metrically equivalent to the one of p in the MME case.\n\nArm in a room results: The results in this more complex and realistic setup are globally similar to\nthe previous ones, which shows a consistency in the phenomenons we just described. However, two\nimportant differences can be pointed out.\nThe \ufb01rst is that, in the MEM case, Dtopo and Dmetric seem to decrease more than what would\n\n7\n\nMEMMMMMEMEMMMMME(a) Discrete world(b) Arm in roomrepresentational spaceground-truth positionmotor spacerepresentational spaceground-truth positionmotor space\fbe expected for an arbitrarily random representation h. We argue that this phenomenon is due\nto a \u201cborder effect\u201d induced by the walls of the room. Indeed, as the sensor\u2019s and environment\u2019s\ndisplacements are limited, each motor state m is statistically associated with a different distribution\nof sensory states over the whole course of the exploration. For instance, a motor state corresponding\nto the arm extended to the right will never experience the sensory states corresponding to the sensor\nbeing at the far-left of the room, and inversely. As the sensory distribution associated with m varies\nsmoothly with p, the agent can indirectly infer the topology of RNp from it. We can indeed see\nin Fig. 3 that h (or even better hp) tends to capture the topology of p in the MEM case, although\nwith less accuracy than in the MM and MEM cases. Note that this was not the case in the previous\ntorus-like grid world as any motor state could be associated with any square of the grid over the\ncourse of the exploration.\nThe second difference is that Dtopo is higher in the MM case than in the previous setup. After\nan empirical visualization of the environments and associated learned representations h, we argue\nthat this is due to potential sensory ambiguity. Indeed, the random room the agent explores can\npresent ambiguities, such that very different positions of the sensor can be associated with very\nsimilar sensory inputs. When this happens, the representation h built by the network can arbitrarily\nencode the same way different motor states associated with these different positions. This leads to a\nrepresentation manifold that is non-trivially twisted in RNh (see Fig. 3(b)), and degrades the measure\nDtopo. Note that this sensory ambiguity is not an issue in the MME case anymore, as the movements\nof the environment help disambiguate these different motor states.\nThe same experiments have been run with a representational space of dimension Nh = 25 and with\nmore complex agent morphologies, and led to qualitatively similar results. This seems to indicate\nthat the capture of the topological and metric invariants is insensitive to the dimension of h and the\ncomplexity of the forward mapping. A more detailed analysis of the all these different simulations\nresults is available in Appendix D and E.\n\n6 Conclusion\n\nWe addressed the problem of the unsupervised grounding of the concept of space in a naive agent\u2019s\nsensorimotor experience. Inspired by previous philosophical and theoretical work, we argue that such\na notion should \ufb01rst emerge as a basic representation of the egocentric position of a sensor moving in\nspace. We showed that the structure of the Euclidean space, in which the agent is immersed alongside\nits environment, induces sensorimotor invariants. They constrain the way motor states gets mapped\nto sensory inputs, independently from the actual content of the environment, and carry information\nabout the topology and metric of the external space. This structure can potentially be extracted from\nthe sensorimotor \ufb02ow to build an internal representation of the sensor egocentric position, grounded\nin the motor space, and abstracted from the content of the environment and the speci\ufb01c sensory states\nit induces. We hypothesized that capturing space-induced invariants is bene\ufb01cial for sensorimotor\nprediction. As a consequence, topological and metric invariants should naturally be captured by\na network learning to perform sensorimotor prediction. We proposed such a network architecture,\nand designed different types of exploration of the environment such that topological and metric\ninvariants were present or not in the resulting sensorimotor transitions fed to the network. We tested\ntwo different simulated agent-environment setups, and showed that when spatial invariants are present\nin the sensorimotor data, they get naturally captured in the internal motor representation built by\nthe agent. So, when the agent can experience consistent sensorimotor transitions during which the\nenvironment does not change, the internal motor representation captures the topology of the external\nspace in which its sensor is moving. Even more interestingly, when the agent can also experience\ndisplacements of the environment between its consistent sensorimotor transitions, the internal motor\nrepresentation captures the metric regularity of this external space. These results thus suggest that the\nconcept of an external Euclidean space, although still in its most basic form here, could emerge in a\nsituated agent as a by-product of learning to predict its sensorimotor experience.\nWe hope this work can be a stepping stone for further extensions of the approach and the unsupervised\nacquisition of richer spatial knowledge. A \ufb01rst obvious step will be to extend the exploration to 3\ntranslations and 3 rotations of the sensor in space. A second very important step will be to derive,\nfrom the current basic egocentric spatial representation, an allocentric representation in which the\nspatial con\ufb01gurations of external objects could also be characterized; a problem for which H.Poincar\u00e9\nalso had some interesting intuitions.\n\n8\n\n\fReferences\n[1] Eric Antonelo and Benjamin Schrauwen. Towards autonomous self-localization of small\nmobile robots using reservoir computing and slow feature analysis. In 2009 IEEE International\nConference on Systems, Man and Cybernetics, pages 3818\u20133823. IEEE, 2009.\n\n[2] Angelo Arleo, Fabrizio Smeraldi, St\u00e9phane Hug, and Wulfram Gerstner. Place cells and spatial\nnavigation based on 2d visual feature extraction, path integration, and reinforcement learning.\nIn Advances in neural information processing systems, pages 89\u201395, 2001.\n\n[3] Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lillicrap, Piotr\nMirowski, Alexander Pritzel, Martin J Chadwick, Thomas Degris, Joseph Modayil, et al. Vector-\nbased navigation using grid-like representations in arti\ufb01cial agents. Nature, 557(7705):429,\n2018.\n\n[4] Phillip J Best, Aaron M White, and Ali Minai. Spatial processing in the brain: the activity of\n\nhippocampal place cells. Annual review of neuroscience, 24(1):459\u2013486, 2001.\n\n[5] Michael Bowling, Ali Ghodsi, and Dana Wilkinson. Action respecting embedding. In Proceed-\n\nings of the 22nd international conference on Machine learning, pages 65\u201372. ACM, 2005.\n\n[6] Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jos\u00e9 Neira, Ian\nReid, and John J Leonard. Past, present, and future of simultaneous localization and mapping:\nToward the robust-perception age. IEEE Transactions on Robotics, 32(6):1309\u20131332, 2016.\n\n[7] Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying\n\ngeneralization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018.\n\n[8] Christopher J Cueva and Xue-Xin Wei. Emergence of grid-like representations by training\nrecurrent neural networks to perform spatial localization. arXiv preprint arXiv:1803.07770,\n2018.\n\n[9] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta\nGarnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene\nrepresentation and rendering. Science, 360(6394):1204\u20131210, 2018.\n\n[10] Michael Garcia Ortiz and Alban La\ufb02aqui\u00e8re. Learning representations of spatial displacement\n\nthrough sensorimotor prediction. arXiv preprint arXiv:1805.06250, 2018.\n\n[11] Nicholas J Gustafson and Nathaniel D Daw. Grid cells, place cells, and geodesic generalization\n\nfor spatial reinforcement learning. PLoS computational biology, 7(10):e1002235, 2011.\n\n[12] David Ha and J\u00fcrgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.\n\n[13] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. 2016.\n\n[14] Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende,\nand Alexander Lerchner. Towards a de\ufb01nition of disentangled representations. arXiv preprint\narXiv:1812.02230, 2018.\n\n[15] Matej Hoffmann, Hugo Marques, Alejandro Arieta, Hidenobu Sumioka, Max Lungarella, and\nRolf Pfeifer. Body schema in robotics: a review. IEEE Transactions on Autonomous Mental\nDevelopment, 2(4):304\u2013324, 2010.\n\n[16] Rico Jonschkowski and Oliver Brock. Learning state representations with robotic priors.\n\nAutonomous Robots, 39(3):407\u2013428, 2015.\n\n[17] Gregory Kahn, Adam Villa\ufb02or, Bosen Ding, Pieter Abbeel, and Sergey Levine. Self-supervised\ndeep reinforcement learning with generalized computation graphs for robot navigation. arXiv\npreprint arXiv:1709.10489, 2017.\n\n[18] Immanuel Kant. Kritik der reinen Vernunft. PUF, 1781.\n\n9\n\n\f[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[20] G\u00fcnter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing\nneural networks. In Advances in Neural Information Processing Systems, pages 971\u2013980, 2017.\n\n[21] Benjamin Kuipers. The spatial semantic hierarchy. Arti\ufb01cial intelligence, 119(1-2):191\u2013233,\n\n2000.\n\n[22] Robert Kwiatkowski and Hod Lipson. Task-agnostic self-modeling machines. Sci. Robot., 4:\n\neaau9354, 2019.\n\n[23] Alban La\ufb02aquiere, Sylvain Argentieri, Olivia Breysse, St\u00e9phane Genet, and Bruno Gas. A\nnon-linear approach to space dimension perception by a naive agent. In Intelligent Robots and\nSystems (IROS), 2012 IEEE/RSJ International Conference on, pages 3253\u20133259. IEEE, 2012.\n\n[24] Alban La\ufb02aquiere, Alexander V Terekhov, Bruno Gas, and J Kevin O\u2019Regan. Learning an\ninternal representation of the end-effector con\ufb01guration space. In Intelligent Robots and Systems\n(IROS), 2013 IEEE/RSJ International Conference on, pages 1230\u20131235. IEEE, 2013.\n\n[25] Alban La\ufb02aqui\u00e8re, J Kevin O\u2019Regan, Sylvain Argentieri, Bruno Gas, and Alexander V Terekhov.\nLearning agent\u2019s spatial con\ufb01guration from sensorimotor invariants. Robotics and Autonomous\nSystems, 71:49\u201359, 2015.\n\n[26] Alban La\ufb02aqui\u00e8re, J Kevin O\u2019Regan, Bruno Gas, and Alexander Terekhov. Discovering space-\ngrounding spatial topology and metric regularity in a naive agent\u2019s sensorimotor experience.\nNeural Networks, 2018.\n\n[27] Valentin Marcel, Sylvain Argentieri, and Bruno Gas. Where do i move my sensors? emergence\n\nof an internal representation from the sensorimotor \ufb02ow.\n\n[28] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino,\nMisha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in\ncomplex environments. arXiv preprint arXiv:1611.03673, 2016.\n\n[29] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv:1312.5602, 2013.\n\n[30] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines.\nIn Proceedings of the 27th international conference on machine learning (ICML-10), pages\n807\u2013814, 2010.\n\n[31] Jean Nicod. La g\u00e9om\u00e9trie dans le monde sensible. Presses universitaires de France, 1924.\n\n[32] J Kevin O\u2019Regan and Alva No\u00eb. A sensorimotor account of vision and visual consciousness.\n\nBehavioral and brain sciences, 24(5):939\u2013973, 2001.\n\n[33] Charles Packer, Katelyn Gao, Jernej Kos, Philipp Kr\u00e4henb\u00fchl, Vladlen Koltun, and Dawn Song.\nAssessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282,\n2018.\n\n[34] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration\nIn International Conference on Machine Learning (ICML),\n\nby self-supervised prediction.\nvolume 2017, 2017.\n\n[35] David Philipona, J Kevin O\u2019Regan, and J-P Nadal. Is there something out there? inferring space\n\nfrom sensorimotor dependencies. Neural computation, 15(9):2029\u20132049, 2003.\n\n[36] Henri Poincar\u00e9. L\u2019espace et la g\u00e9om\u00e9trie. Revue de m\u00e9taphysique et de morale, 3(6):631\u2013646,\n\n1895.\n\n[37] Kimberly L Stachenfeld, Matthew M Botvinick, and Samuel J Gershman. The hippocampus as\n\na predictive map. Nature neuroscience, 20(11):1643, 2017.\n\n10\n\n\f[38] Alexander V Terekhov and J Kevin O\u2019Regan. Space as an invention of active agents. Frontiers\n\nin Robotics and AI, 3:4, 2016.\n\n[39] Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-\nJean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently controllable\nfeatures. arXiv preprint arXiv:1708.01289, 2017.\n\n[40] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to\ncontrol: A locally linear latent dynamics model for control from raw images. In Advances in\nneural information processing systems, pages 2746\u20132754, 2015.\n\n[41] Greg Wayne, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-\nBarwinska, Jack Rae, Piotr Mirowski, Joel Z Leibo, Adam Santoro, et al. Unsupervised\npredictive memory in a goal-directed agent. arXiv preprint arXiv:1803.10760, 2018.\n\n[42] Daniel Weiller, Robert M\u00e4rtin, Sven D\u00e4hne, Andreas K Engel, and Peter K\u00f6nig. Involving\nmotor capabilities in the formation of sensory space representations. PloS one, 5(4):e10377,\n2010.\n\n11\n\n\f", "award": [], "sourceid": 3880, "authors": [{"given_name": "Alban", "family_name": "Laflaqui\u00e8re", "institution": "SoftBank Robotics Europe"}, {"given_name": "Michael", "family_name": "Garcia Ortiz", "institution": "SoftBank Robotics Europe"}]}