{"title": "Principles of Riemannian Geometry  in Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2807, "page_last": 2816, "abstract": "This study deals with neural networks in the sense of geometric transformations acting on the coordinate representation of the underlying data manifold which the data is sampled from. It forms part of an attempt to construct a formalized general theory of neural networks in the setting of Riemannian geometry. From this perspective, the following theoretical results are developed and proven for feedforward networks. First it is shown that residual neural networks are finite difference approximations to dynamical systems of first order differential equations, as opposed to ordinary networks that are static. This implies that the network is learning systems of differential equations governing the coordinate transformations that represent the data. Second it is shown that a closed form solution of the metric tensor on the underlying data manifold can be found by backpropagating the coordinate representations learned by the neural network itself. This is formulated in a formal abstract sense as a sequence of Lie group actions on the metric fibre space in the principal and associated bundles on the data manifold. Toy experiments were run to confirm parts of the proposed theory, as well as to provide intuitions as to how neural networks operate on data.", "full_text": "Principles of Riemannian Geometry\n\nin Neural Networks\n\nMichael Hauser\n\nDepartment of Mechanical Engineering\n\nPennsylvania State University\n\nState College, PA 16801\nmzh190@psu.edu\n\nAsok Ray\n\nDepartment of Mechanical Engineering\n\nPennsylvania State University\n\nState College, PA 16801\n\naxr2@psu.edu\n\nAbstract\n\nThis study deals with neural networks in the sense of geometric transformations\nacting on the coordinate representation of the underlying data manifold which\nthe data is sampled from. It forms part of an attempt to construct a formalized\ngeneral theory of neural networks in the setting of Riemannian geometry. From\nthis perspective, the following theoretical results are developed and proven for\nfeedforward networks. First it is shown that residual neural networks are (cid:27)-\nnite di(cid:29)erence approximations to dynamical systems of (cid:27)rst order di(cid:29)erential\nequations, as opposed to ordinary networks that are static. This implies that the\nnetwork is learning systems of di(cid:29)erential equations governing the coordinate\ntransformations that represent the data. Second it is shown that a closed form\nsolution of the metric tensor on the underlying data manifold can be found by\nbackpropagating the coordinate representations learned by the neural network\nitself. This is formulated in a formal abstract sense as a sequence of Lie group\nactions on the metric (cid:27)bre space in the principal and associated bundles on the\ndata manifold. Toy experiments were run to con(cid:27)rm parts of the proposed theory,\nas well as to provide intuitions as to how neural networks operate on data.\n\n1 Introduction\n\nThe introduction is divided into two parts. Section 1.1 attempts to succinctly describe ways in\nwhich neural networks are usually understood to operate. Section 1.2 articulates a more minority\nperspective. It is this minority perspective that this study develops, showing that there exists a rich\nconnection between neural networks and Riemannian geometry.\n\n1.1 Latent variable perspectives\nNeural networks are usually understood from a latent variable perspective, in the sense that\nsuccessive layers are learning successive representations of the data. For example, convolution\nnetworks [10] are understood quite well as learning hierarchical representations of images [19].\nLong short-term memory networks [9] are designed such that input data act on a memory cell to\navoid problems with long term dependencies. More complex devices like neural Turing machines\nare designed with similar intuitions for reading and writing to a memory [6].\nResidual networks were designed [7] with the intuition that it is easier to learn perturbations from\nthe identity map than it is to learn an unreferenced map. Further experiments then suggest that\nresidual networks work well because, during forward propagation and back propagation, the signal\nfrom any block can be mapped to any other block [8]. After unraveling the residual network, this\nattribute can be seen more clearly. From this perspective, the residual network can be understood\nas an ensemble of shallower networks [17].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fx(0)\n\nx(1)\n\nM\n\nx(2)\n\nx(L)\n\nx(3)\n\n\u03d5(1)\n\nx(2) (M )\n\nx(3) (M )\n\n\u03d5(2)\n\n\u03d5(L)\n\nx(L) (M )\n\nx(0) (M )\n\nx(1) (M )\n\n\u03d5(0)\n\nFigure 1: Coordinate systems x(l+1) := \u03d5(l) \u25e6 ... \u25e6 \u03d5(1) \u25e6 \u03d5(0) \u25e6 x(0) induced by the coordinate\n\ntransformations \u03d5(l) : x(l) (M ) \u2192(cid:0)\u03d5(l) \u25e6 x(l)(cid:1) (M ) learned by the neural network. The pullback\n\u03d5(l)\u2217 : T x(l) (M ) \u2192 T(cid:0)\u03d5(l) \u25e6 x(l)(cid:1) (M ) between tangent spaces.\n\nmetric gx(l)(M ) (X, Y ) := g(\u03d5(l)\u25e6x(l))(M )\npulls-back (i.e. backpropagates) the co-\nordinate representation of the metric tensor from layer l + 1 to layer l, via the pushforward map\n\n\u03d5(l)\u2217 X, \u03d5(l)\u2217 Y\n\n(cid:17)\n\n(cid:16)\n\n1.2 Geometric perspectives\n\nThese latent variable perspectives are a powerful tool for understanding and designing neural\nnetworks. However, they often overlook the fundamental process taking place, where successive\nlayers successively warp the coordinate representation of the data manifold with nonlinear trans-\nformations into a form where the classes in the data manifold are linearly separable by hyperplanes.\nThese nested compositions of a(cid:28)ne transformations followed by nonlinear activations can be seen\nby work done by C. Olah (http://colah.github.io/) and published by LeCun et al. [11].\nResearch in language modeling has shown that the word embeddings learned by the network\npreserve vector o(cid:29)sets[13], with an example given as xapples \u2212 xapple \u2248 xcars \u2212 xcar for the\nword embedding vector xi. This suggests the network is learning a word embedding space with\nsome resemblance to group closure, with group operation vector addition. Note that closure is\ngenerally not a property of data, for if instead of word embeddings one had images of apples and\ncars, preservation of these vector o(cid:29)sets would certainly not hold at the input [3]. This is because\nthe input images are represented in Cartesian coordinates, but are not sampled from a (cid:30)at data\nmanifold, and so one should not measure vector o(cid:29)sets by Euclidean distance. In Locally Linear\nEmbedding [14], a coordinate system is learned in which Euclidean distance can be used. This work\nshows that neural networks are also learning a coordinate system in which the data manifold can\nbe measured by Euclidean distance, and the coordinate representation of the metric tensor can be\nbackpropagated through to the input so that distance can be measured in the input coordinates.\n\n2 Mathematical notations\n\n.b, are placeholders to keep track of which index comes (cid:27)rst, second, etc.\n\nEinstein notation is used throughout this paper. A raised index in parenthesis, such as x(l), means\nit is the lth coordinate system while \u03d5(l) means it is the lth coordinate transformation. If the index\nis not in parenthesis, a superscript free index means it is components of a vector, a subscript free\nindex means it is components of a covector, and a repeated index means implied summation. The .\nin tensors, such as Aa.\nA (topological) manifold M of dimension dimM is a Hausdor(cid:29), paracompact topological space that\nis locally homeomorphic to RdimM [18]. This homeomorphism x : U \u2192 x (U ) \u2286 RdimM is called\na coordinate system on U \u2286 M. Non-Euclidean manifolds, such as S1, can be created by taking\nan image and rotating it in a circle. A feedforward network learns coordinate transformations\nx(l+1) (M ), and is initialized in Cartesian coordinates x(0) : M \u2192 x(0) (M ), as seen in Figure 1.\nA data point q \u2208 M can only be represented as numbers with respect to some coordinate system;\nwith the coordinates at layer l + 1, q is represented as the layerwise composition x(l+1) (q) :=\n\n\u03d5(l) : x(l) (M ) \u2192 (cid:0)\u03d5(l) \u25e6 x(l)(cid:1) (M ), where the new coordinates x(l+1) := \u03d5(l)(cid:0)x(l)(cid:1) : M \u2192\n(cid:0)\u03d5(l) \u25e6 ... \u25e6 \u03d5(1) \u25e6 \u03d5(0) \u25e6 x(0)(cid:1) (q). The output coordinate representation is x(L) (M ) \u2286 Rd.\ncoordinates as x(l+1) := \u03d5(l)(cid:0)x(l)(cid:1) := f (x(l); l). Note ReLu is not a bijection and thus not a proper\ncoordinate transformation. A residual network transforms coordinates as x(l+1) := \u03d5(l)(cid:0)x(l)(cid:1) :=\n\nFor an activation function f, such as ReLU or tanh, a standard feedforward network transforms\n\nx(l) + f (x(l); l). Note that these are global coordinates over the entire manifold. A residual network\nwith ReLu activation is bijective, and is piecewise linear with kinks of in(cid:27)nite curvature.\n\n2\n\n\fsoftmax(cid:0)W (L) \u00b7 x(L)(cid:1)j\n\n/(cid:80)K\n\nWith the Softmax coordinate transformation de(cid:27)ned as\neW (L)j x(L)\n\n:=\nk=1 eW (L)kx(L) the probability of q \u2208 M being from class j is P (Y = j|X = q) =\n\nsoftmax(cid:0)W (L) \u00b7 x(L) (q)(cid:1)j.\n3 Neural networks as Ck di(cid:29)erentiable coordinate transformations\nOne can de(cid:27)ne entire classes of coordinate transformations. The following formulation also has the\nform of di(cid:29)erentiable curves/trajectories, but because the number of dimensions often changes as\none moves through the network, it is di(cid:28)cult to interpret a trajectory traveling through a space of\nchanging dimensions. A standard feedforward neural network is a C0 function:\n\n(1)\nA residual network has the form x(l+1) = x(l) + f (x(l); l). However, because of eventually taking\nthe limit as L \u2192 \u221e and l \u2208 [0, 1] \u2282 R, as opposed to l being only a (cid:27)nitely countable index, the\nequivalent form of the residual network is as follows:\n\nx(l+1) := f (x(l); l)\n\nx(l+1) (cid:39) x(l) + f (x(l); l)\u2206l\n\n(2)\nwhere \u2206l = 1/L for a uniform partition of the interval [0, 1] and is implicit in the weight matrix.\nOne can de(cid:27)ne entire classes of coordinate transformations inspired by (cid:27)nite di(cid:29)erence approxima-\ntions of di(cid:29)erential equations. These can be used to impose kth order di(cid:29)erentiable smoothness:\n(3)\n(4)\nEach of these de(cid:27)ne a di(cid:29)erential equation, but of di(cid:29)erent orders of smoothness on the coordinate\ntransformations. Written in this form the residual network in Equation 3 is a (cid:27)rst-order forward\ndi(cid:29)erence approximation to a C1 coordinate transformation and has O (\u2206l) error. Network architec-\ntures with higher order accuracies can be constructed, such as central di(cid:29)erencing approximations\n\n\u03b4x(l) := x(l+1) \u2212 x(l) (cid:39) f (x(l); l)\u2206l\n\u03b42x(l) := x(l+1) \u2212 2x(l) + x(l\u22121) (cid:39) f (x(l); l)\u2206l2\n\nof a C1 coordinate transformation to give O(cid:0)\u2206l2(cid:1) error.\n\nNote that the architecture of a standard feedforward neural network is a static equation, while the\nothers are dynamic. Also note that Equation 4 can be rewritten x(l+1) = x(l) + f (x(l); l)\u2206l2 +\n\u03b4x(l\u22121), where \u03b4x(l\u22121) = x(l) \u2212 x(l\u22121), and in this form one sees that this is a residual network\nwith an extra term \u03b4x(l\u22121) acting as a sort of momentum term on the coordinate transformations.\nThis momentum term is explored in Section 7.1.\nBy the de(cid:27)nitions of the Ck networks given by Equations 3-4, the right hand side is both continuous\nand independent of \u2206l (after dividing), and so the limit exists as \u2206l \u2192 0. Convergence rates and\nerror bounds of (cid:27)nite di(cid:29)erence approximations can be applied to these equations. By the standard\nde(cid:27)nition of the derivative, the residual network de(cid:27)nes a system of di(cid:29)erentiable transformations.\n\ndx(l)\n\nx(l+\u2206l) \u2212 x(l)\n\n:= lim\n\u2206l\u21920\n\ndl\nx(l+\u2206l) \u2212 2x(l) + x(l\u2212\u2206l)\n\n\u2206l\n\n= f (x(l); l)\n\n(5)\n\nd2x(l)\n\ndl2\n\n(6)\nNotations are slightly changed, by taking l = n\u2206l for n \u2208 {0, 1, 2, .., L \u2212 1} and indexing the\nlayers by the fractional index l instead of the integer index n. This de(cid:27)nes a partitioning:\n\n= f (x(l); l)\n\n:= lim\n\u2206l\u21920\n\n\u2206l2\n\nP = {0 = l(0) < l(1) < l(2) < ... < l(n) < ... < l(L) = 1}\n\n(7)\nwhere \u2206l(n) := l(n + 1) \u2212 l(n) can in general vary with n as the maxn \u2206l(n) still goes to zero as\nL \u2192 \u221e. To reduce notation, this paper will write \u2206l := \u2206l (n) for all n \u2208 {0, 1, 2, ..., L \u2212 1}.\nIn [4], a deep residual convolution network was trained on ImageNet in the usual fashion except\nparameter weights between residual blocks at the same dimension were shared, at a cost to the\naccuracy of only 0.2%. This is the di(cid:29)erence between learning an inhomogeneous (cid:27)rst order\nequation dx(l)\n:= f (x(l)).\ndl\n\n:= f (x(l); l) and a (piecewise) homogeneous (cid:27)rst order equation dx(l)\ndl\n\n3\n\n\f(a) A C0 network with sharply changing layer-wise particle trajectories.\n\n(b) A C1 network with smooth layer-wise particle trajectories.\n\n(c) A C2 network also exhibits smooth layer-wise particle trajectories.\n\n(d) A combination C0 and C1 network, where the identity connection is left out in layer 6.\n\nFigure 2: Untangling the same spiral with 2-dimensional neural networks with di(cid:29)erent constraints\non smoothness. The x and y axes are the two nodes of the neural network at a given layer l, where\nlayer 0 is the input data. The C0 network is a standard network, while the C1 network is a residual\nnetwork and the C2 network also exhibits smooth layerwise transformations. All networks achieve\n0.0% error rates. The momentum term in the C2 network allows the red and blue sets to pass over\neach other in layers 3, 4 and 5. Figure 2d has the identity connection for all layers other than layer 6.\n\n4 The Riemannian metric tensor learned by neural networks\n\nFrom the perspective of di(cid:29)erentiable geometry, as one moves through the layers of the neural\nnetwork, the data manifold stays the same but the coordinate representation of the data manifold\nchanges with each successive a(cid:28)ne transformation and nonlinear activation. The objective of the\nneural network is to (cid:27)nd a coordinate representation of the data manifold such that the classes are\nlinearly separable by hyperplanes.\nDe(cid:27)nition 4.1. (Riemannian manifold [18]) A Riemannian manifold (M, g) is a real smooth\nmanifold M with an inner product, de(cid:27)ned by the positive de(cid:27)nite metric tensor g, varying\nsmoothly on the tangent space of M.\n\nIf the network has been well trained as a classi(cid:27)er, then by Euclidean distance two input points\nof the same class may be far apart when represented by the input coordinates but close together\nin the output coordinates. Similarly, two points of di(cid:29)erent classes may be near each other when\nrepresented by the input coordinates but far apart in the output coordinates. These ideas form the\nbasis of Locally Linear Embeddings [14]. The intuitive way to measure distances is in the output\ncoordinates, which even in the unsupervised case tends to be a (cid:30)attened representation of the data\nmanifold [3]. Accordingly, the metric in the output coordinates is the Euclidean metric:\n\nThe elements of the metric tensor transforms as a tensor with coordinate transformations:\n\ng(x(l))albl =\n\ng(x(l+1))al+1bl+1\n\n(cid:16)\n\ng\n\nx(L)(cid:17)\n(cid:19)al+1.\n\n(cid:18) \u2202x(l+1)\n\n:= \u03b7aLbL\n\naLbL\n\n(cid:18) \u2202x(l+1)\n\n(cid:19)bl+1.\n\n\u2202x(l)\n\n.al\n\n\u2202x(l)\n\n.bl\n\n4\n\n(8)\n\n(9)\n\nlayer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10layer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10layer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10layer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10\fFigure 3: A C1 (residual) network with a hyperbolic tangent activation function separating the\n\nspiral manifold. Additionally, balls of constant radius ds =(cid:112)galbl (x(l))dx(l)al dx(l)bl at di(cid:29)erent\n\npoints are shown. In the output coordinates, distances are measured by the standard Euclidean\nmetric in Equation 8, and so the circles are \"round\". The coordinate representation of the metric\ntensor is pulled-back (backpropagated) through the network to the input by Equations 9 and 10.\nDistances on the data manifold can then be measured with the input Cartesian coordinates, and so\nthe circles are not round. These balls can also be interpreted as forming an \u0001 \u2212 \u03b4 relationship across\nlayers of the network, where an \u0001-ball at one layer corresponds to a \u03b4-ball at the previous layer.\n\nThe above recursive formula is solved from the output layer to the input, i.e. the coordinate\nrepresentation of the metric tensor is backpropagated through the network from output to input:\n\n\uf8ee\uf8f0(cid:32)\n\nl(cid:89)\n\nl(cid:48)=L\u22121\n\n(cid:33)al(cid:48)+1.\n\n(cid:32)\n\n\u2202x(l(cid:48)+1)\n\u2202x(l(cid:48))\n\n.al(cid:48)\n\n\u2202x(l(cid:48)+1)\n\u2202x(l(cid:48))\n\n(cid:33)bl(cid:48)+1.\n\n\uf8f9\uf8fb \u03b7aLbL\n\n.bl(cid:48)\n\ng(x(l))albl =\n\nIf the network is taken to be residual as in Equation 2, then the Jacobian of the coordinate transfor-\nmation is found, with \u03b4al+1.\n\nthe Kronecker delta:\n\n.al\n\n(cid:18) \u2202x(l+1)\n\n(cid:19)al+1.\n\n\u2202x(l)\n\n.al\n\n= \u03b4al+1.\n\n.al\n\n+\n\n(cid:32)\n\n\u2202f(cid:0)x(l); l(cid:1)\n\n(cid:33)al+1.\n\n\u2202x(l)\n\n\u2206l\n\n.al\n\n(10)\n\n(11)\n\n(12)\n\nBackpropagating the coordinate representation of the metric tensor requires the sequence of matrix\nproducts from output to input, and can be de(cid:27)ned for any layer l:\n\n(cid:32)\n\n(cid:33)al(cid:48)+1.\n\n(cid:32)\n\nP aL.\n.al\n\n:=\n\nal(cid:48)+1.\n.al(cid:48) +\n\n\u2202f (z(l(cid:48)+1); l(cid:48))\n\n\u2202z(l(cid:48)+1)\n\n\u2202z(l(cid:48)+1)\n\u2202x(l(cid:48))\n\n.el(cid:48)+1\n\n.al(cid:48)\n\n\uf8ee\uf8f0\u03b4\n\nL\u22121(cid:89)\n\nl(cid:48)=l\n\n(cid:33)el(cid:48)+1.\n\n\uf8f9\uf8fb\n\n\u2206l\n\nwhere z(l+1) := W (l) \u00b7 x(l) + b(l). With this, taking the output metric to be the standard Euclidean\nmetric \u03b7ab, the linear element can be represented in the coordinate space for any layer l:\n\nP b.\n.bl\n\ndxal dxbl\n\nds2 = \u03b7abP a.\n.al\n\n(13)\nThe data manifold is independent of coordinate representation. At the output where distances\nare measured by the standard Euclidean metric an \u0001-ball can be de(cid:27)ned. The linear element in\nEquation 13 de(cid:27)nes the corresponding \u03b4-ball at layer l. This can be used to see what in the input\nspace the neural network encodes as similar in the output space.\nAs L \u2192 \u221e, Equation 12 becomes an in(cid:27)nite product of matricies (from our in(cid:27)nite applications of\nthe chain rule) and these transformations act smoothly along the (cid:27)bres of the tensor bundle. The\nproof that this sequence converges in the limit can be found in the appendix.\nThis analysis has so far assumed a constant layerwise dimension, which is not how most neural\nnetworks are used in practice, where the number of nodes often changes. This is handled by the\npullback metric [18]. Manifolds can be submersed and immersed into lower and higher dimensional\nspaces so long as the rank of the pushforward Jacobian matrix is constant for every p \u2208 M [12]. The\ndimension of the underlying data manifold is de(cid:27)ned as the dimension of the smallest, bottleneck\nlayer of the neural network, i.e. dimM := minl dimx(l) (M ), and all other higher dimensional\nlayers are immersion/embedding representations of this lowest dimensional representation.\n\n5\n\nlayer 0layer 1layer 2layer 3layer 4layer 5\fDe(cid:27)nition 4.2. (Pushforward map) Let M and N be topological manifolds, \u03d5(l) : M \u2192 N a\nsmooth map and T M and T N be their respective tangent spaces. Also let X \u2208 T M where\nX : C\u221e (M ) \u2192 R, and f \u2208 C\u221e (N ). The pushforward is the linear map \u03d5(l)\u2217 : T M \u2192 T N that\n\ntakes an element X (cid:55)\u2192 \u03d5(l)\u2217 X and is de(cid:27)ned by its action on f as(cid:0)\u03d5(l)\u2217 X(cid:1) (f ) := X (f \u25e6 \u03d5(l)).\n(cid:0)\u03d5(l)\u2217 X, \u03d5(l)\u2217 Y(cid:1) \u2200X, Y \u2208 T M.\n\nDe(cid:27)nition 4.3. (Pullback metric) Let (M, gM ) and (N, gN ) be Riemannian manifolds, \u03d5(l) : M \u2192\nN a smooth map and \u03d5(l)\u2217 : T M \u2192 T N the pushforward between their tangent spaces T M and\nT N. Then the pullback metric on M is given by gM (X, Y ) := gN\nIn practice being able to change dimensions in the neural network is important for many reasons.\nOne reason is that neural networks usually have access to a limited number of types of nonlinear\ncoordinate transformations, for example tanh, \u03c3 and ReLU. This severely limits the ability of the\nnetwork to separate the wide variety of manifolds that exist. For example, the networks have\ndi(cid:28)culty linearly separating the simple toy spirals in Figures 2 because they only have access to\ncoordinate transformations of the form tanh. If instead they had access to a coordinate system that\nwas more appropriate for spirals, such as polar coordinates, they could very easily separate the data.\nThis is the reason why Locally Linear Embeddings [14] could very easily discover the coordinate\ncharts for the underlying manifold, because k-nearest neighbors is an extremely (cid:30)exible type of\nnonlinearity. Allowing the network to go into higher dimensions makes it easier to separate data.\n\n5 Lie Group actions on the metric (cid:27)bre bundle\n\nThis section will abstractly formulate Section 4 as neural networks learning sequences of left Lie\nGroup actions on the metric ((cid:27)bre) space over the data manifold to make the metric representation of\nthe underlying data manifold Euclidean. Several de(cid:27)nitions, which can be found in the appendix in\nthe full version of this paper, are needed to formulate Lie group actions on principal and associated\n(cid:27)bre bundles, namely of bundles, (cid:27)bre bundles, Lie Groups and their actions on manifolds [18].\nDe(cid:27)nition 5.1. (Principal (cid:27)bre bundle) A bundle (E, \u03c0, M ) is called a principal G-bundle if:\n(i.) E is equipped with a right G-action (cid:67): E \u00d7 G \u2192 E.\n(ii.) The right G-action (cid:67) is free.\n(iii.) (E, \u03c0, M ) is (bundle) isomorphic to (E, \u03c1, E/G) where the surjective projection map \u03c1 : E \u2192\nE/G is de(cid:27)ned by \u03c1 (\u0001) := [\u0001] as the equivalence class of points of \u0001\nRemark. (Principal bundle) The principal (cid:27)bre bundle can be thought of (locally) as a (cid:27)bre bundle\nwith (cid:27)bres G over the base manifold M.\nDe(cid:27)nition 5.2. (Associated (cid:27)bre bundle) Given a G principal bundle and a smooth manifold F on\nwhich exists a left G-action (cid:66): G \u00d7 F \u2192 F , the associated (cid:27)bre bundle (PF , \u03c0F , M ) is de(cid:27)ned as\nfollows:\n(i.) let \u223cG be the relation on P \u00d7 F de(cid:27)ned as follows:\n(p, f ) \u223cG (p(cid:48), f(cid:48)) : \u21d0\u21d2 \u2203h \u2208 G : p(cid:48) = p (cid:67) h and f(cid:48) = h\u22121 (cid:66) f, and thus PF := (P \u00d7 F ) / \u223cG.\n(ii.) de(cid:27)ne \u03c0F : PF \u2192 M by \u03c0F ([(p, f )]) := \u03c0 (p)\nNeural network actions on the manifold M are a (layerwise) sequence of left G-actions on the\nassociated (metric space) (cid:27)bre bundle. Let the dimension of the manifold d := dim M.\nThe structure group G is taken to be the general linear group of dimension d over R, i.e.\nG = GL (d, R) := {\u03c6 : Rd \u2192 Rd| det \u03c6 (cid:54)= 0}.\nThe principal bundle P is taken to be the frame bundle, i.e. P = LM := \u222ap\u2208M LpM :=\n\u222ap\u2208M{(e1, ..., ed) \u2208 TpM| (e1, ..., ed) is a basis of TpM}, where TpM is the tangent space of\nM at the point p \u2208 M.\nThe right G-action (cid:67): LM \u00d7 GL (d, R) \u2192 LM is de(cid:27)ned by e (cid:67) h = (e1, ..., ed) (cid:67) h :=\n.1 eal , ..., hal.\n(hal.\n\nThe (cid:27)bre F in the associated bundle will be the metric tensor space, and so F =(cid:0)Rd(cid:1)\u2217 \u00d7(cid:0)Rd(cid:1)\u2217,\nthe inverse of the left, namely(cid:0)h\u22121 (cid:66) g(cid:1)\n\nwhere the \u2217 denotes the cospace. With this, the left G-action (cid:66): GL (d, R) \u00d7 F \u2192 F is de(cid:27)ned as\n\n.d eal ), which is the standard transformation law of linear algebra.\n\n:= (g (cid:67) h)albl\n\n= gal+1bl+1hal+1.\n\n.al hbl+1.\n\n.\n\n.bl\n\nalbl\n\n6\n\n\f(cid:0)h\u22121\n\n0\n\n(cid:66) h\u22121\n\n1\n\n(cid:66) ... (cid:66) h\u22121\n\nL\n\nLayerwise sequential applications of the left G-action from output to input is thus simply understood:\n\n(cid:66) g(cid:1)\n\na0b0\n\n=(cid:0)h\u22121\n\n0 \u2022 ... \u2022 h\u22121\n\nL\n\n(cid:1) (cid:66) gaLbL =\n\n0(cid:89)\n\n(cid:16)\n\nl(cid:48)=L\u22121\n\nh\n\nal(cid:48)+1.\n.a(cid:48)\n\nl\n\nh\n\nbl(cid:48)+1.\n.bl(cid:48)\n\n(cid:17)\n\ngaLbL\n\n(14)\n\nThis is equivalent to Equation 10, only formulated in a formal, abstract sense.\n\n6 Backpropagation as a sequence of right Lie Group actions\n\nA similar analysis that has been performed in Sections 4 and 5 can be done to generalize error\nbackpropagation as a sequence of right Lie Group actions on the output error (or more generally\npull-back the frame bundle). The discrete layerwise error backpropagation algorithm [15] is derived\nusing the chain rule on graphs. The closed form solution of the gradient of the output error\nE with respect to any layer weight W (l\u22121) can be solved for recursively from the output, by\nbackpropagating errors:\n\n\u2202E\n\n\u2202W (l\u22121)\n\n=\n\naL\n\n\u2202x(L)\n\n(cid:19)\n\nl(cid:48)=L\u22121\n\nl(cid:89)\n\n\u2202x(l(cid:48)+1)\n\u2202x(l(cid:48))\n\n(cid:18) \u2202E\n\n(cid:32)\n(cid:16) \u2202x(l)\non the output frame bundle(cid:0) \u2202\n\n\u2202W (l\u22121)\n\n.al(cid:48)\n\n(cid:33)al(cid:48)+1.\n(cid:17)al\n(cid:1)\n\n=\n\n\u2202x(L)\n\naL\n\n(cid:18) \u2202x(l)\n(cid:17)al.\n(cid:16) \u2202x(l)\n\n\u2202W (l\u22121)\n\n(cid:19)al\n(cid:16) \u2202z(l)\n\nIn practice, one further applies the chain rule\nthat W (l\u22121) is a coordinate chart on the parameter manifold [1], not the data manifold.\nIn\nthis form it is immediately seen that error backpropagation is a sequence of right G-actions\n. This pulls-back the frame bundle\n\n(cid:16) \u2202x(l(cid:48)+1)\n\n(cid:17)al(cid:48)+1.\n\n(cid:81)l\n\n\u2202W (l\u22121)\n\n\u2202z(l)\n\n.bl\n\nl(cid:48)=L\u22121\n\n\u2202x(l(cid:48))\n\n.al(cid:48)\n\nacting on E to the coordinate system at layer l, and thus puts it in the same space as\nFor the residual network, the transformation matrix Equation 11 can be inserted into Equation 15.By\nthe same logic as before, the in(cid:27)nite tensor product in Equation 15 converges in the limit L \u2192 \u221e in\nthe same way as in Equation 12, and so it is not rewritten here. In the limit this becomes a smooth\nright G-action on the frame bundle, which itself is acting on the error cost function.\n\n\u2202W (l\u22121)\n\n(15)\n\n(cid:17)bl. Note\n(cid:17)al.\n(cid:16) \u2202x(l)\n\n7 Numerical experiments\n\nThis section presents the results of numerical experiments used to understand the proposed theory.\nThe C\u221e hyperbolic tangent has been used for all experiments, with weights initialized according\nto [5]. For all of the experiments, layer 0 is the input Cartesian coordinate representation of the\ndata manifold, and the (cid:27)nal layer L is the last hidden layer before the linear softmax classi(cid:27)er. GPU\nimplementations of the neural networks are written in the Python library Theano [2, 16].\n\n7.1 Neural networks with Ck di(cid:29)erentiable coordinate transformations\nAs described in Section 3, kth order smoothness can be imposed on the network by considering\nnetwork structures de(cid:27)ned by e.g. Equations 3-4. As seen in Figure 2a, the standard C0 network\nwith no impositions on di(cid:29)erentiability has very sharp layerwise transformations and separates\nthe data in an unintuitive way. The C1 residual network and C2 network can be seen in Figures 2b\nand 2c, and exhibit smooth layerwise transformations and separate the data in a more intuitive way.\nForward di(cid:29)erencing is used for the C1 network, while central di(cid:29)erencing was used for the C2\nnetwork, except at the output layer where backward di(cid:29)erencing was used, and at the input (cid:27)rst\norder smoothness was used as forward di(cid:29)erencing violates causality.\nIn Figure 2c one can see that for the C2 network the red and blue data sets pass over each other\nin layers 4, 5 and 6. This can be understood as the C2 network has the same form as a residual\nnetwork, with an additional momentum term pushing the data past each other.\n\n7\n\n\f(a) A batch size of 300 for untangling data. As early as layer 4 the input connected sets have been disconnected\nand the data are untangled in an unintuitive way. This means a more complex coordinate representation of\nthe data manifold was learned.\n\n(b) A batch size of 1000 for untangling data. Because the large batch size can well-sample the data manifold, the\nspiral sets stay connected and are untangled in an intuitive way. This means a simple coordinate representation\nof the data manifold was learned.\nFigure 4: The e(cid:29)ect of batch size on coordinate representation learned by the same 2-dimensional\nC1 network, where layer 0 is the input representation, and both examples achieve 0% error. A basic\ntheorem in topology says continuous functions map connected sets to connected sets. A small batch\nsize of 300 during training sparsely samples from the connected manifold and the network learns\nover(cid:27)tted coordinate representations. With a larger batch size of 1000 during training the network\nlearns a simpler coordinate representation and keeps the connected input connected throughout.\n\n7.2 Coordinate representations of the data manifold and metric tensor\n\nAs described in Sections 4 and 5, the network is learning a sequence of non-linear coordinate\ntransformations, beginning with Cartesian coordinates, to (cid:27)nd a coordinate representation of the\ndata manifold that well represents the data, and this representation tends to be (cid:30)at. This process\ncan be visualized in Figure 3. This experiment used a C1 (residual) network and so the group actions\non the principal and associated bundles act approximately smoothly along the (cid:27)bres of the bundles.\nIn the forward direction, beginning with Cartesian coordinates, a sequence of C1 di(cid:29)erential coordi-\nnate transformations is applied to (cid:27)nd a nonlinear coordinate representation of the data manifold\nsuch that in the output coordinates the classes satisfy the cost restraint. In the reverse direction,\nstarting with a standard Euclidean metric at the output, Equation 8, the coordinate representation\nof the metric tensor is backpropagated through the network to the input by Equations 9-10 to (cid:27)nd\nthe metric tensor representation in the input Cartesian coordinates. The principal components of\nthe metric tensor are used to draw the ellipses in Figure 3.\n\n7.3 E(cid:29)ect of batch size on set connectedness and topology\n\nA basic theorem in topology says that continuous functions map connected sets to connected sets.\nHowever, in Figure 4a it is seen that as early as layer 4 the continuous neural network is breaking\nthe connected input set into disconnected sets. Additionally, and although it achieves 0% error,\nit is learning very complicated and unintuitive coordinate transformations to represent the data\nin a linearly separable form. This is because during training with a small batch size of 300 in\nthe stochastic gradient descent search, the underlying manifold was not su(cid:28)ciently sampled to\nrepresent the entire connected manifold and so it seemed disconnected.\nThis is compared to Figure 4b in which a larger batch size of 1000 was used and was su(cid:28)ciently\nsampled to represent the entire connected manifold, and the network was also able to achieve 0%\nerror. The coordinate transformations learned by the neural network with the larger batch size\nseem to more intuitively untangle the data in a simpler way than that of Figure 4a. Note that this\nexperiment is in 2-dimensions, and with higher dimensional data the issue of batch size and set\nconnectedness becomes exponentially more important by the curse of dimensionality.\n\n8\n\nlayer 0layer 2layer 4layer 6layer 8layer 10layer 12layer 14layer 16layer 18layer 20layer 22layer 24layer 26layer 28layer 30layer 32layer 34layer 36layer 38layer 40layer 42layer 44layer 46layer 48layer 50layer 0layer 2layer 4layer 6layer 8layer 10layer 12layer 14layer 16layer 18layer 20layer 22layer 24layer 26layer 28layer 30layer 32layer 34layer 36layer 38layer 40layer 42layer 44layer 46layer 48layer 50\f(a) A 10 layer C1 network struggles to separate the spirals and has 1% error rate.\n\n(b) A 20 layer C1 network is able to separate the spirals and has 0% error rate.\n\n(c) A 40 layer C1 network is able to separate the spirals and has 0% error rate.\n\nFigure 5: The e(cid:29)ect of number of layers on the separation process of a C1 neural network. In\nFigure 5a it is seen that the \u2206l is too large to properly separate the data. In Figures 5b and 5c the\n\u2206l is su(cid:28)ciently small to separate the data. Interestingly, the separation process is not as simple as\nmerely doubling the parameterization and halving the partitioning in Equation 7 because this is\na nonlinear system of ODE\u2019s. This is seen in Figures 5b and 5c; the data are at di(cid:29)erent levels of\nseparation at the same position of layer parameterization, for example by comparing layer 18 in\nFigure 5b to layer 36 in Figure 5c.\n\n7.4 E(cid:29)ect of number of layers on the separation process\nThis experiment compares the process in which 2-dimensional C1 networks with 10, 20 and 40\nlayers separate the same data, thus experimenting on the \u2206l in the partitioning of Equation 7, as\nseen in Figure 5. The 10 layer network is unable to properly separate the data and achieves a 1%\nerror rate, whereas the 20 and 40 layer networks both achieve 0% error rates. In Figures 5b and 5c\nit is seen that at same positions of layer parameterization, for example layers 18 and 36 respectively,\nthe data are at di(cid:29)erent levels of separation. This implies that the partitioning cannot be interpreted\nas simply as halving the \u2206l when doubling the number of layers. This is because the system of\nODE\u2019s are nonlinear and the \u2206l is implicit in the weight matrix.\n\n8 Conclusions\n\nThis paper forms part of an attempt to construct a formalized general theory of neural networks as a\nbranch of Riemannian geometry. In the forward direction, and starting in Cartesian coordinates, the\nnetwork is learning a sequence of coordinate transformations to (cid:27)nd a coordinate representation of\nthe data manifold that well encodes the data, and experimental results suggest this imposes a (cid:30)atness\nconstraint on the metric tensor in this learned coordinate system. One can then backpropagate the\ncoordinate representation of the metric tensor to (cid:27)nd its form in Cartesian coordinates. This can be\nused to de(cid:27)ne an \u0001\u2212 \u03b4 relationship between the input and output data. Coordinate backpropagation\nwas formulated in a formal, abstract sense in terms of Lie Group actions on the metric (cid:27)bre bundle.\nThe error backpropagation algorithm was then formulated in terms of Lie group actions on the\nframe bundle. For a residual network in the limit, the Lie group acts smoothly along the (cid:27)bres of the\nbundles. Experiments were conducted to con(cid:27)rm and better understand aspects of this formulation.\n\n9 Acknowledgements\n\nThis work has been supported in part by the U.S. Air Force O(cid:28)ce of Scienti(cid:27)c Research (AFOSR)\nunder Grant No. FA9550-15-1-0400. The (cid:27)rst author has been supported by PSU/ARL Walker Fel-\nlowship. Any opinions, (cid:27)ndings and conclusions or recommendations expressed in this publication\nare those of the authors and do not necessarily re(cid:30)ect the views of the sponsoring agencies.\n\n9\n\nlayer 0layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9layer 10layer 0layer 2layer 4layer 6layer 8layer 10layer 12layer 14layer 16layer 18layer 20layer 0layer 4layer 8layer 12layer 16layer 20layer 24layer 28layer 32layer 36layer 40\fReferences\n[1] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American\n\nMathematical Soc., 2007.\n\n[2] Fr\u00e9d\u00e9ric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud\nBergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features\nand speed improvements. arXiv preprint arXiv:1211.5590, 2012.\n\n[3] Yoshua Bengio, Gr\u00e9goire Mesnil, Yann Dauphin, and Salah Rifai. Better mixing via deep\nIn Proceedings of the 30th International Conference on Machine Learning\n\nrepresentations.\n(ICML-13), pages 552\u2013560, 2013.\n\n[4] Alexandre Boulch. Sharesnet: reducing residual network parameter number by sharing\n\nweights. arXiv preprint arXiv:1702.08782, 2017.\n\n[5] Xavier Glorot and Yoshua Bengio. Understanding the di(cid:28)culty of training deep feedforward\n\nneural networks. In Aistats, volume 9, pages 249\u2013256, 2010.\n\n[6] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint\n\narXiv:1410.5401, 2014.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 770\u2013778, 2016.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European Conference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[9] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[10] Alex Krizhevsky, Ilya Sutskever, and Geo(cid:29)rey E Hinton. Imagenet classi(cid:27)cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[11] Yann LeCun, Yoshua Bengio, and Geo(cid:29)rey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\n2015.\n\n[12] John M Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1\u201329. Springer,\n\n2003.\n\n[13] Tomas Mikolov, Wen-tau Yih, and Geo(cid:29)rey Zweig. Linguistic regularities in continuous space\n\nword representations. In Hlt-naacl, volume 13, pages 746\u2013751, 2013.\n\n[14] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. science, 290(5500):2323\u20132326, 2000.\n\n[15] David E Rumelhart, Geo(cid:29)rey E Hinton, and Ronald J Williams. Learning internal representa-\n\ntions by error propagation. Technical report, DTIC Document, 1985.\n\n[16] Theano Development Team. Theano: A Python framework for fast computation of mathemat-\n\nical expressions. arXiv e-prints, abs/1605.02688, May 2016.\n\n[17] Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles\nof relatively shallow networks. In Advances in Neural Information Processing Systems, pages\n550\u2013558, 2016.\n\n[18] Gerard Walschap. Metric structures in di(cid:29)erential geometry, volume 224. Springer Science &\n\nBusiness Media, 2012.\n\n[19] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional\nnetworks. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages\n2528\u20132535. IEEE, 2010.\n\n10\n\n\f", "award": [], "sourceid": 1594, "authors": [{"given_name": "Michael", "family_name": "Hauser", "institution": "Pennsylvania State University"}, {"given_name": "Asok", "family_name": "Ray", "institution": "Pennsylvania State University"}]}