{"title": "On Invariance in Hierarchical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 162, "page_last": 170, "abstract": "A goal of central importance in the study of hierarchical models for object recognition -- and indeed the visual cortex -- is that of understanding quantitatively the trade-off between invariance and selectivity, and how invariance and discrimination properties contribute towards providing an improved representation useful for learning from data. In this work we provide a general group-theoretic framework for characterizing and understanding invariance in a family of hierarchical models. We show that by taking an algebraic perspective, one can provide a concise set of conditions which must be met to establish invariance, as well as a constructive prescription for meeting those conditions. Analyses in specific cases of particular relevance to computer vision and text processing are given, yielding insight into how and when invariance can be achieved. We find that the minimal sets of transformations intrinsic to the hierarchical model needed to support a particular invariance can be clearly described, thereby encouraging efficient computational implementations.", "full_text": "On Invariance in Hierarchical Models\n\nJake Bouvrie, Lorenzo Rosasco, and Tomaso Poggio\n\nCenter for Biological and Computational Learning\n\nMassachusetts Institute of Technology\n\n{jvb,lrosasco}@mit.edu, tp@ai.mit.edu\n\nCambridge, MA USA\n\nAbstract\n\nA goal of central importance in the study of hierarchical models for object recogni-\ntion \u2013 and indeed the mammalian visual cortex \u2013 is that of understanding quantita-\ntively the trade-off between invariance and selectivity, and how invariance and dis-\ncrimination properties contribute towards providing an improved representation\nuseful for learning from data. In this work we provide a general group-theoretic\nframework for characterizing and understanding invariance in a family of hierar-\nchical models. We show that by taking an algebraic perspective, one can provide\na concise set of conditions which must be met to establish invariance, as well\nas a constructive prescription for meeting those conditions. Analyses in speci\ufb01c\ncases of particular relevance to computer vision and text processing are given,\nyielding insight into how and when invariance can be achieved. We \ufb01nd that the\nminimal intrinsic properties of a hierarchical model needed to support a particular\ninvariance can be clearly described, thereby encouraging ef\ufb01cient computational\nimplementations.\n\nIntroduction\n\n1\nSeveral models of object recognition drawing inspiration from visual cortex have been developed\nover the past few decades [3, 8, 6, 12, 10, 9, 7], and have enjoyed substantial empirical success. A\ncentral theme found in this family of models is the use of Hubel and Wiesel\u2019s simple and complex\ncell ideas [5]. In the primary visual cortex, simple units compute features by looking for the occur-\nrence of a preferred stimulus in a region of the input (\u201creceptive \ufb01eld\u201d). Translation invariance is\nthen explicitly built into the processing pathway by way of complex units which pool locally over\nsimple units. The alternating simple-complex \ufb01ltering/pooling process is repeated, building increas-\ningly invariant representations which are simultaneously selective for increasingly complex stimuli.\nIn a computer implementation, the \ufb01nal representation can then be presented to a supervised learning\nalgorithm.\nFollowing the \ufb02ow of processing in a hierarchy from the bottom upwards, the layerwise representa-\ntions gain invariance while simultaneously becoming selective for more complex patterns. A goal of\ncentral importance in the study of such hierarchical architectures and the visual cortex alike is that of\nunderstanding quantitatively this invariance-selectivity tradeoff, and how invariance and selectivity\ncontribute towards providing an improved representation useful for learning from examples. In this\npaper, we focus on hierarchical models incorporating an explicit attempt to impose transformation\ninvariance, and do not directly address the case of deep layered models without local transformation\nor pooling operations (e.g. [4]).\nIn a recent effort, Smale et al. [11] have established a framework which makes possible a more pre-\ncise characterization of the operation of hierarchical models via the study of invariance and discrim-\nination properties. However, Smale et al. study invariance in an implicit, rather than constructive,\nfashion. In their work, two cases are studied: invariance with respect to image rotations and string\nreversals, and the analysis is tailored to the particular setting. In this paper, we reinterpret and ex-\ntend the invariance analysis of Smale et al. using a group-theoretic language towards clarifying and\nunifying the general properties necessary for invariance in a family of hierarchical models. We show\nthat by systematically applying algebraic tools, one can provide a concise set of conditions which\nmust be met to establish invariance, as well as a constructive prescription for meeting those condi-\ntions. We additionally \ufb01nd that when one imposes the mild requirement that the transformations of\ninterest have group structure, a broad class of hierarchical models can only be invariant to orthog-\n\n1\n\n\fInvariance of a Hierarchical Feature Map\n\nonal transformations. This result suggests that common architectures found in the literature might\nneed to be rethought and modi\ufb01ed so as to allow for broader invariance possibilities. Finally, we\nshow that our framework automatically points the way to ef\ufb01cient computational implementations\nof invariant models.\nThe paper is organized as follows. We \ufb01rst recall important de\ufb01nitions from Smale et al. Next, we\nextend the machinery of Smale et al. to a more general setting allowing for general pooling func-\ntions, and give a proof for invariance of the corresponding family of hierarchical feature maps. This\ncontribution is key because it shows that several results in [11] do not depend on the particular choice\nof pooling function. We then establish a group-theoretic framework for characterizing invariance in\nhierarchical models expressed in terms of the objects de\ufb01ned here. Within this framework, we turn\nto the problem of invariance in two speci\ufb01c domains of practical relevance: images and text strings.\nFinally, we conclude with a few remarks summarizing the contributions and relevance of our work.\nAll proofs are omitted here, but can be found in the online supplementary material [2]. The reader\nis assumed to be familiar with introductory concepts in group theory. An excellent reference is [1].\n2\nWe \ufb01rst review important de\ufb01nitions and concepts concerning the neural response feature map pre-\nsented in Smale et al. The reader is encouraged to consult [11] for a more detailed discussion. We\nwill draw attention to the conditions needed for the neural response to be invariant with respect\nto a family of arbitrary transformations, and then generalize the neural response map to allow for\narbitrary pooling functions. The proof of invariance given in [11] is extended to this generalized\nsetting. The proof presented here (and in [11]) hinges on a technical \u201cAssumption\u201d which must be\nveri\ufb01ed to hold true, given the model and the transformations to which we would like to be invariant.\nTherefore the key step to establishing invariance is veri\ufb01cation of this Assumption. After stating the\nAssumption and how it \ufb01gures into the overall picture, we explore its veri\ufb01cation in Section 3. There\nwe are able to describe, for a broad range of hierarchical models (including a class of convolutional\nneural networks [6]), the necessary conditions for invariance to a set of transformations.\n2.1 De\ufb01nition of the Feature Map and Invariance\nFirst consider a system of patches of increasing size associated to successive layers of the hierarchy,\nv1 \u2282 v2 \u2282 \u00b7\u00b7\u00b7 \u2282 vn \u2286 S, with vn taken to be the size of the full input. Here layer n is the\ntop-most layer, and the patches are pieces of the domain on which the input data are de\ufb01ned. The\nset S could contain, for example, points in R2 (in the case of 2D graphics) or integer indices (the\ncase of strings). Until Section 4, the data are seen as general functions, however it is intuitively\nhelpful to think of the special case of images, and we will use a notation that is suggestive of this\nparticular case. Next, we\u2019ll need spaces of functions on the patches, Im(vi). In many cases it will\nonly be necessary to work with arbitrary successive pairs of patches (layers), in which case we will\ndenote by u the smaller patch, and v the next larger patch. We next introduce the transformation\nsets Hi, i = 1, . . . , n intrinsic to the model. These are abstract sets in general, however here we\nwill take them to be comprised of translations with h \u2208 Hi de\ufb01ned by h : vi \u2192 vi+1. Note that by\nconstruction, the functions h \u2208 Hi implicitly involve restriction. For example, if f \u2208 Im(v2) is an\nimage of size v2 and h \u2208 H1, then f \u25e6 h is a piece of the image of size v1. The particular piece is\ndetermined by h. Finally, to each layer we also associate a dictionary of templates, Qi \u2286 Im(vi).\nThe templates could be randomly sampled from Im(vi), for example.\n\nGiven the ingredients above, the neural response Nm(f) and associated derived kernel (cid:98)Km are\nnel (cid:98)K1, the m-th derived kernel (cid:98)Km, for m = 2, . . . , n, is obtained by normalizing Km(f, g) =\n(cid:104)Nm(f), Nm(g)(cid:105)L2(Qm\u22121) where Nm(f)(q) = maxh\u2208H (cid:98)Km\u22121(f \u25e6 h, q),\nHere a kernel is normalized by taking (cid:98)K(f, g) = K(f, g)/(cid:112)K(f, f)K(g, g). Note that the neural\n\nde\ufb01ned as follows.\nDe\ufb01nition 1 (Neural Response). Given a non-negative valued, normalized, initial reproducing ker-\nq \u2208 Qm\u22121 with\n\nresponse decomposes the input into a hierarchy of parts, analyzing sub-regions at different scales.\nThe neural response and derived kernels describe in compact, abstract terms the core operations built\ninto the many related hierarchical models of object recognition cited above.\nWe next de\ufb01ne a set of transformations, distinct from the Hi above, to which we would like to be\ninvariant. Let r \u2208 Ri, i \u2208 {1, . . . , n \u2212 1}, be transformations that can be viewed as mapping\neither vi to itself or vi+1 to itself (depending on the context in which it is applied). We rule out the\ndegenerate translations and transformations, h or r mapping their entire domain to a single point.\nWhen it is necessary to identify transformations de\ufb01ned on a speci\ufb01c domain v, we will use the\nnotation rv : v \u2192 v. Invariance of the neural response feature map can now be de\ufb01ned.\n\nH = Hm\u22121.\n\n2\n\n\frv \u25e6 h = \u03c0(h) \u25e6 ru\n\n(1)\n\nDe\ufb01nition 2 (Invariance). The feature map Nm is invariant to the domain transformation r \u2208 R if\n\nNm(f) = Nm(f \u25e6 r), for all f \u2208 Im(vm), or equivalently, (cid:98)Km(f \u25e6 r, f) = 1, for all f \u2208 Im(vm).\n\nInvariance and Generalized Pooling\n\nIn order to state the invariance properties of a given feature map, a technical assumption is needed.\nAssumption 1 (from [11]). Fix any r \u2208 R. There exists a surjective map \u03c0 : H \u2192 H satisfying\nfor all h \u2208 H.\nThis technical assumption is best described by way of an example. Consider images and rotations:\nthe assumption stipulates that rotating an image and then taking a restriction must be equivalent to\n\ufb01rst taking a (different) restriction and then rotating the resulting image patch. As we will describe\nbelow, establishing invariance will boil down to verifying Assumption 1.\n2.2\nWe next provide a generalized proof of invariance of a family of hierarchical feature maps, where\nthe properties we derive do not depend on the choice of the pooling function. Given the above\nassumption, invariance can be established for general pooling functions of which the max is only\none particular choice. We will \ufb01rst de\ufb01ne such general pooling functions, and then describe the\ncorresponding generalized feature maps. The \ufb01nal step will then be to state an invariance result for\nthe generalized feature map, given that Assumption 1 holds.\nLet H = Hi, with i \u2208 {1, . . . , n \u2212 1}, and let B(R) denote the Borel algebra of R. As in Assump-\ntion 1, we de\ufb01ne \u03c0 : H \u2192 H to be a surjection, and let \u03a8 : B(R++) \u2192 R++ be a bounded pooling\nfunction de\ufb01ned for Borel sets B \u2208 B(R) consisting of only positive elements. Here R++ denotes\nthe set of strictly positive reals. Given a positive functional F acting on elements of H, we de\ufb01ne\nthe set F (H) \u2208 B(R) as\nNote that since \u03c0 is surjective, \u03c0(H) = H, and therefore (F \u25e6 \u03c0)(H) = F (H).\nWith these de\ufb01nitions in hand, we can de\ufb01ne a more general neural response as follows. For H =\nHm\u22121 and all q \u2208 Q = Qm\u22121, let the neural response be given by\n\nF (H) = {F [h] | h \u2208 H}.\n\nwhere\n\nNm(f)(q) = (\u03a8 \u25e6 F )(H)\n\nF [h] = (cid:98)Km\u22121(f \u25e6 h, q).\n\nNm(f) = Nm(f \u25e6 r),\n\nGiven Assumption 1, we can now prove invariance of a neural response feature map built from the\ngeneral pooling function \u03a8.\n\nTheorem 1. Given any function \u03a8 : B(R++) \u2192 R++, if the initial kernel satis\ufb01es (cid:98)K1(f, f \u25e6 r) = 1\n\nAveraging: We can consider average pooling by setting \u03a8(B) = (cid:82)\n\nfor all r \u2208 R, f \u2208 Im(v1), then\nfor all r \u2208 R, f \u2208 Im(vm) and m \u2264 n.\nWe give a few practical examples of the pooling function \u03a8.\nMaximum: The original neural response is recovered setting \u03a8(B) = sup B .\nx\u2208B xd\u00b5 . If H has a measure\n\u03c1H, then a natural choice for \u00b5 is the induced push-forward measure \u03c1H \u25e6 F \u22121. The measure \u03c1H\nmay be simply uniform, or in the case of a \ufb01nite set H, discrete. Similarly, we may consider more\ngeneral weighted averages.\n3 A Group-Theoretic Invariance Framework\nThis section establishes general de\ufb01nitions and conditions needed to formalize a group-theoretic\nconcept of invariance. When Assumption 1 holds, then the neural response map can be made in-\nvariant to the given set of transformations. Proving invariance thus reduces to verifying that the\nAssumption actually holds, and is valid. A primary goal of this paper is to place this task within\nan algebraic framework so that the question of verifying the Assumption can be formalized and\nexplored in full generality with respect to model architecture, and the possible transformations. For-\nmalization of Assumption 1 culminates in De\ufb01nition 3 below, where purely algebraic conditions\nare separated from conditions stemming from the mechanics of the hierarchy. This separation re-\nsults in a simpli\ufb01ed problem because one can then tackle the algebraic questions independent of and\nuntangled from the model architecture.\nOur general approach is as follows. We will require that R is a subset of a group and then use\nalgebraic tools to understand when and how Assumption 1 can be satis\ufb01ed given different instances\n\n3\n\n\fof R. If R is \ufb01xed, then the assumption can only be satis\ufb01ed by placing requirements on the sets of\nbuilt-in translations Hi, i = 1, . . . , n. Therefore, we will make quantitative, constructive statements\nabout the minimal sets of translations associated to a layer required to support invariance to a set of\ntransformations. Conversely, one can \ufb01x Hi and then ask whether the resulting feature map will be\ninvariant to any transformations. We explore this perspective as well, particularly in the examples\nof Section 4, where speci\ufb01c problem domains are considered.\n3.1 Formulating Conditions for Invariance\nRecall that vi \u2282 S. Because it will be necessary to translate in S, it is assumed that an appropriate\nnotion of addition between the elements of S is given. If G is a group, we denote the (left) action of\nG on S by A : G\u00d7S \u2192 S. Given an element g \u2208 G, the notation Ag : S \u2192 S will be utilized. Since\nA is a group action, it satis\ufb01es (Ag \u25e6 Ag(cid:48))(x) = Agg(cid:48)(x) for all x \u2208 S and all g, g(cid:48) \u2208 G. Consider\nan arbitrary pair of successive layers with associated patch sizes u and v, with u \u2282 v \u2282 S. Recall\nthat the de\ufb01nition of the neural response involves the \u201cbuilt-in\u201d translation functions h : u \u2192 v,\nfor h \u2208 H = Hu. Since S has an addition operation, we may parameterize h \u2208 H explicitly as\nha(x) = x + a for x \u2208 u and parameter a \u2208 v such that (u + a) \u2282 v. The restriction behavior\nof the translations in H prevents us from simply generating a group out of the elements of H. To\nget around this dif\ufb01culty, we will decompose the h \u2208 H into a composition of two functions: a\ntranslation group action and an inclusion.\nLet S generate a group of translations T by de\ufb01ning the injective map\n\nArv \u25e6 Ata = Atb \u25e6 Aru ,\n\n(4)\n\n4\n\nS \u2192 T\na (cid:55)\u2192 ta.\n\nha = Ata \u25e6 \u03b9u\n\n(2)\nThat is, to every element of a \u2208 S we associate a member of the group T whose action corresponds\nto translation in S by a: Ata(x) = x + a for x, a \u2208 S. (Although we assume the speci\ufb01c case of\ntranslations throughout, the sets of intrinsic operations Hi may more generally contain other kinds\nof transformations. We assume, however, that T is abelian.) Furthermore, because the translations\nH can be parameterized by an element of S, one can apply Equation (2) to de\ufb01ne an injective map\n\u03c4 : H \u2192 T by ha (cid:55)\u2192 ta. Finally, we de\ufb01ne \u03b9u : u (cid:44)\u2192 S to be the canonical inclusion of u into S.\nWe can now rewrite ha : u \u2192 v as\nNote that because a satis\ufb01es (u + a) \u2282 v by de\ufb01nition, im(Ata \u25e6 \u03b9u) \u2282 v automatically.\nIn the statement of Assumption 1, the transformations r \u2208 R can be seen as maps from u to itself,\nor from v to itself, depending on which side of Equation (1) they are applied. To avoid confusion\nwe denoted the former case by ru and the latter by rv. Although ru and rv are the same \u201ckind\u201d\nof transformation, one cannot in general associate to each \u201ckind\u201d of transformation r \u2208 R a single\nelement of some group as we did in the case of translations above. The group action could very\nwell be different depending on the context. We will therefore consider ru and rv to be distinct\ntransformations, loosely associated to r. In our development, we will make the important assumption\nthat the transformations ru, rv \u2208 R can be expressed as actions of elements of some group, and\ndenote this group by R. More precisely, for every ru \u2208 R, there is assumed to be a corresponding\nelement \u03c1u \u2208 R whose action satis\ufb01es A\u03c1u(x) = ru(x) for all x \u2208 u, and similarly, for every rv \u2208\nR, there is assumed to be a corresponding element \u03c1v \u2208 R whose action satis\ufb01es A\u03c1v(x) = rv(x)\nfor all x \u2208 v. The distinction between \u03c1u and \u03c1v will become clear in the case of feature maps\nde\ufb01ned on functions whose domain is a \ufb01nite set (such as strings). In the case of images, we will\nsee that \u03c1u = \u03c1v.\nAssumption 1 requires that rv \u25e6 h = h(cid:48) \u25e6 ru for h, h(cid:48) \u2208 H, with the map \u03c0 : h (cid:55)\u2192 h(cid:48) onto. We\nnow restate this condition in group-theoretic terms. De\ufb01ne \u02dcT = \u03c4(Hu) \u2286 T to be the set of group\nelements corresponding to Hu. Set h = ha, h(cid:48) = hb, and denote also by ru, rv the elements of\nthe group R corresponding to the given transformation r \u2208 R. The Assumption says in part that\nrv \u25e6 h = h(cid:48) \u25e6 ru for some h(cid:48) \u2208 H. This can now be expressed as\n\nArv \u25e6 Ata \u25e6 \u03b9u = Atb \u25e6 \u03b9u \u25e6 Aru \u25e6 \u03b9u\n\n(3)\nfor some tb \u2208 \u02dcT . In order to arrive at a purely algebraic condition for invariance, we will need\nto understand and manipulate compositions of group actions. However on the right-hand side of\nEquation (3) the translation Atb is separated from the transformation Aru by the inclusion \u03b9u. We\nwill therefore need to introduce an additional constraint on R. This constraint leads to our \ufb01rst\ncondition for invariance: If x \u2208 u, then we require that Aru(x) \u2208 u for all r \u2208 R. One can now see\nthat if this condition is met, then verifying Equation (3) reduces to checking that\n\n\fand that the map ta (cid:55)\u2192 tb is onto.\nThe next step is to turn compositions of actions Ax \u25e6 Ay into an equivalent action of the form Axy.\nDo do this, one needs R and T to be subgroups of the same group G so that the associativity property\nof group actions applies. A general way to accomplish this is to form the semidirect product\n\nG = T (cid:111) R.\n\n(5)\nRecall that the semidirect product G = X (cid:111) Y is a way to put two subgroups X, Y together where\nX is required to be normal in G, and X \u2229 Y = {1} (the usual direct product requires both subgroups\nto be normal). In our setting G is easily shown to be isomorphic to a group with normal subgroup T\nand subgroup R where each element may be written in the form g = tr for t \u2208 T, r \u2208 R. We will\nsee below that we do not loose generality by requiring T to be normal. Note that although this con-\nstruction precludes R from containing the transformations in T , allowing R to contain translations\nis an uninteresting case.\nConsider now the action Ag for g \u2208 G = T (cid:111) R. Returning to Equation (4), we can apply the\nassociativity property of actions and see that Equation (4) will hold as long as\n\nrv \u02dcT = \u02dcT ru\n\n(6)\nfor every r \u2208 R. This is our second condition for invariance, and is a purely algebraic requirement\nconcerning the groups R and T , distinct from the restriction related conditions involving the patches\nu and v.\nThe two invariance conditions we have described thus far combine to capture the content of Assump-\ntion 1, but in a manner that separates group related conditions from constraints due to restriction and\nthe nested nature of an architecture\u2019s patch domains. We can summarize the invariance conditions\nin the form of a concise de\ufb01nition that can be applied to establish invariance of the neural response\nfeature maps Nm(f), 2 \u2264 m \u2264 n with respect to a set of transformations. Let \u02dcR \u2286 R be the set of\ntransformations for which we would like to prove invariance, in correspondence with R.\nDe\ufb01nition 3 (Compatible Sets). The subsets \u02dcR \u2282 R and \u02dcT \u2282 T are compatible if all of the following\nconditions hold:\n\n1. For each r \u2208 \u02dcR, rv \u02dcT = \u02dcT ru. When ru = rv for all r \u2208 R, this means that normalizer of\n\n\u02dcT in \u02dcR is \u02dcR.\n\n2. Left transformations rv never take a point in v outside of v, and right transformations ru\n\nnever take a point in u/v outside of u/v (respectively):\nimAru \u25e6 \u03b9u \u2286 u,\n\nimArv \u25e6 \u03b9v \u2286 v,\n\nimAru \u25e6 \u03b9v \u2286 v,\n\nfor all r \u2208 \u02dcR.\n\n3. Translations never take a point in u outside of v:\n\nfor all t \u2208 \u02dcT .\n\nimAt \u25e6 \u03b9u \u2286 v\n\nThe \ufb01nal condition above has been added to ensure that any set of translations \u02dcT we might construct\nsatisfy the implicit assumption that the hierarchy\u2019s translation functions h \u2208 H are maps which\nrespect the de\ufb01nition h : u \u2192 v.\nIf \u02dcR and \u02dcT are compatible, then for each ta \u2208 \u02dcT Equation 3 holds for some tb \u2208 \u02dcT , and the map\nta (cid:55)\u2192 tb is surjective from \u02dcT \u2192 \u02dcT (by Condition (1) above). So Assumption 1 holds.\nAs will become clear in the following section, the tools available to us from group theory will\nprovide insight into the structure of compatible sets.\n3.2 Orbits and Compatible Sets\nSuppose we assume that \u02dcR is a subgroup (rather than just a subset), and ask for the smallest com-\npatible \u02dcT . We will show that the only way to satisfy Condition (1) in De\ufb01nition 3 is to require that\n\u02dcT be a union of \u02dcR-orbits, under the action\n\n(7)\nfor t \u2208 T , r \u2208 \u02dcR. This perspective is particularly illuminating because it will eventually allow us\nto view conjugation by a transformation r as a permutation of \u02dcT , thereby establishing surjectivity of\n\nu\n\n(t, r) (cid:55)\u2192 rvtr\u22121\n\n5\n\n\fthe map \u03c0 de\ufb01ned in Assumption 1. For computational reasons, viewing \u02dcT as a union of orbits is\nalso convenient.\nIf rv = ru = r, then the action (7) is exactly conjugation and the \u02dcR-orbit of a translation t \u2208 T\nis the conjugacy class C \u02dcR(t) = {rtr\u22121 | r \u2208 \u02dcR}. Orbits of this form are also equivalence classes\nunder the relation s \u223c s(cid:48) if s(cid:48) \u2208 C \u02dcR(s), and we will require \u02dcT to be partitioned by the conjugacy\nclasses induced by \u02dcR.\nThe following Proposition shows that, given set of candidate translations in H, we can construct a\nset of translations compatible with \u02dcR by requiring \u02dcT to be a union of \u02dcR-orbits under the action of\nconjugation.\nProposition 1. Let \u0393 \u2286 T be a given set of translations, and assume the following: (1) G \u223c= T (cid:111) R,\n(2) For each r \u2208 R, r = ru = rv, (3) \u02dcR is a subgroup of R. Then Condition (1) of De\ufb01nition 3 is\nsatis\ufb01ed if and only if \u02dcT can be expressed as a union of orbits of the form\n\nC \u02dcR(t) .\n\n(8)\n\n\u02dcT = (cid:91)\n\nt\u2208\u0393\n\nAn interpretation of the above Proposition, is that when \u02dcT is a union of \u02dcR-orbits, conjugation by\nr can be seen as a permutation of \u02dcT . In general, a given \u02dcT may be decomposed into several such\norbits and the conjugation action of \u02dcR on \u02dcT may not necessarily be transitive.\n4 Analysis of Speci\ufb01c Invariances\nWe continue with speci\ufb01c examples relevant to image processing and text analysis.\n4.1\nConsider the case where G is the group M of planar isometries, u \u2282 v \u2282 S = R2, and H involves\ntranslations in the plane. Let O2 be the group of orthogonal operators, and let ta \u2208 T denote a\ntranslation represented by the vector a \u2208 R2. In this section we assume the standard basis and work\nwith matrix representations of G when it is convenient.\nWe \ufb01rst need that T (cid:67)M, a property that will be useful when verifying Condition (1) of De\ufb01nition 3.\nIndeed, from the First Isomorphism Theorem [1], the quotient space M/T is isomorphic to O2,\ngiving the following commutative diagram:\n\nIsometries of the Plane\n\nwhere the isomorphism \u02dc\u03c0 : M/T \u2192 O2 is given by \u02dc\u03c0(mT ) = \u03c0(m) and \u03c6(m) = mT . We recall\nthat the kernel of a group homomorphism \u03c0 : G \u2192 G(cid:48) is a normal subgroup of G, and that normal\nsubgroups N of G are invariant under the operation of conjugation by elements g of G. That is,\ngN g\u22121 = N for all g \u2208 G. With this picture in mind, the following Lemma establishes that T (cid:67) M,\nand further shows that M is isomorphic to T (cid:111) R with R = O2, and T a normal subgroup of M.\nLemma 1. For each m \u2208 M, ta \u2208 T , mta = tbm for some unique element tb \u2208 T .\nWe are now in a position to verify the Conditions of De\ufb01nition 3 for the case of planar isometries.\nProposition 2. Let H be the set of translations associated to an arbitrary layer of the hierarchical\nfeature map and de\ufb01ne the injective map \u03c4 : H \u2192 T by ha (cid:55)\u2192 ta, where a is a parameter char-\nacterizing the translation. Set \u0393 = {\u03c4(h) | h \u2208 H}. Take G = M \u223c= T (cid:111) O2 as above. The\nsets\n\n\u02dcR = O2,\n\n\u02dcT = (cid:91)\n\nt\u2208\u0393\n\nC \u02dcR(t)\n\nare compatible.\nThis proposition states that the hierarchical feature map may be made invariant to isometries, how-\never one might reasonably ask whether the feature map can be invariant to other transformations.\nThe following Proposition con\ufb01rms that isometries are the only possible transformations, with group\nstructure, to which the hierarchy may be made invariant in the exact sense of De\ufb01nition 2.\nProposition 3. Assume that the input spaces {Im(vi)}n\u22121\ni=1 are endowed with a norm inherited from\nIm(vn) by restriction. Then at all layers, the group of orthogonal operators O2 is the only group of\ntransformations to which the neural response can be invariant.\n\nM\n\n\u03c0- O2\n-\n\u03c6 ? \u02dc\u03c0\nM/T\n\n6\n\n\fFigure 1: Example illustrating construction of an appro-\npriate H. Suppose H initially contains the translations\n\u0393 = {ha, hb, hc}. Then to be invariant to rotations, the\ncondition on H is that H must also include translations\nde\ufb01ned by the \u02dcR-orbits O \u02dcR(ta), O \u02dcR(tb) and O \u02dcR(tc). In\nthis example \u02dcR = SO2, and the orbits are translations\nto points lying on a circle in the plane.\n\nThe following Corollary is immediate:\n\nCorollary 1. The neural response cannot be scale invariant, even if (cid:98)K1 is.\n\nWe give a few examples illustrating the application of the Propositions above.\nExample 1. If we choose the group of rotations of the plane by setting \u02dcR = SO2 (cid:67) O2, then the\norbits O \u02dcR(a) are circles of radius (cid:107)a(cid:107). See Figure 1. Therefore rotation invariance is possible as\nlong as the set \u02dcT (and therefore H, since we can take H = \u03c4\u22121( \u02dcT )) includes translations to all\npoints along the circle of radius a, for each element ta \u2208 \u02dcT . In particular if H includes all possible\nrotations as long as (cid:98)K1 is. A similar argument can be made for re\ufb02ection invariance, as any rotation\ntranslations, then Assumption 1 is veri\ufb01ed, and we can apply Theorem 1: Nm will be invariant to\n\ncan be built out of the composition of two re\ufb02ections.\nExample 2. Analogous to the previous example, we may also consider \ufb01nite cyclical groups Cn\ndescribing rotations by \u03b8 = 2\u03c0/n. In this case the construction of an appropriate set of translations\nis similar: we require that \u02dcT include at least the conjugacy classes with respect to the group Cn,\nCCn(t) for each t \u2208 \u0393 = \u03c4(H).\nExample 3. Consider a simple convolutional neural network [6] consisting of two layers, one \ufb01lter\nat the \ufb01rst convolution layer, and downsampling at the second layer de\ufb01ned by summation over all\ndistinct k \u00d7 k blocks. In this case, Proposition 2 and Theorem 1 together say that if the \ufb01lter kernel\nis rotation invariant, then the output representation will be invariant to global rotation of the input\nimage. This is so because convolution implies the choice K1(f, g) = (cid:104)f, g(cid:105)L2, average pooling,\nand H = H1 containing all possible translations. If the convolution \ufb01lter z is rotation invariant,\nz \u25e6 r = z for all rotations r, and K1(f \u25e6 r, z) = K1(f, z \u25e6 r\u22121) = K1(f, z). So we can conclude\ninvariance of the initial kernel.\n4.2 Strings, Re\ufb02ections, and Finite Groups\nWe next consider the case of \ufb01nite length strings de\ufb01ned on a \ufb01nite alphabet. One of the advantages\ngroup theory provides in the case of string data is that we need not work with permutation repre-\nsentations. Indeed, we may equivalently work with group elements which act on strings as abstract\nobjects. The de\ufb01nition of the neural response given in Smale et al. involves translating an analysis\nwindow over the length of a given string. Clearly translations over a \ufb01nite string do not constitute a\ngroup as the law of composition is not closed in this case. We will get around this dif\ufb01culty by \ufb01rst\nconsidering closed words formed by joining the free ends of a string. Following the case of circular\ndata where arbitrary translations are allowed, we will then consider the original setting described in\nSmale et al. in which strings are \ufb01nite non-circular objects.\nTaking a geometric standpoint sheds light on groups of transformations applicable to strings. In\nparticular, one can interpret the operation of the translations in H as a circular shift of a string\nfollowed by truncation outside of a \ufb01xed window. The cyclic group of circular shifts of an n-string\nis readily seen to be isomorphic to the group of rotations of an n-sided regular polygon. Similarly,\nreversal of an n-string is isomorphic to re\ufb02ection of an n-sided polygon, and describes a cyclic group\nof order two. As in Equation (5), we can combine rotation and re\ufb02ection via a semidirect product\n\n(9)\n\n\u223c= Cn (cid:111) C2\n\nDn\n\n7\n\nvivi+1taOR~(tc)OR~(tb)OR~(ta)tbtc\fDn = (cid:104)t, r | tn, r2\n\nwhere Ck denotes the cyclic group of order k. The resulting product group has a familiar presen-\ntation. Let t, r be the generators of the group, with r corresponding to re\ufb02ection (reversal), and t\ncorresponding to a rotation by angle 2\u03c0/n (leftward circular shift by one character). Then the group\nof symmetries of a closed n-string is described by the relations\nv, rvtrvt(cid:105).\n\n(10)\nThese relations can be seen as describing the ways in which an n-string can be left unchanged. The\n\ufb01rst says that circularly shifting an n-string n times gives us back the original string. The second says\nthat re\ufb02ecting twice gives back the original string, and the third says that left-shifting then re\ufb02ecting\nis the same as re\ufb02ecting and then right-shifting. In describing exhaustively the symmetries of an\nn-string, we have described exactly the dihedral group Dn of symmetries of an n-sided regular\npolygon. As manipulations of a closed n-string and an n-sided polygon are isomorphic, we will use\ngeometric concepts and terminology to establish invariance of the neural response de\ufb01ned on strings\nwith respect to reversal. In the following discussion we will abuse notation and at times denote by u\nand v the largest index associated with the patches u and v.\nIn the case of re\ufb02ections of strings, ru is quite distinct from rv. The latter re\ufb02ection, rv, is the\nusual re\ufb02ection of an v-sided regular polygon, whereas we would like ru to re\ufb02ect a smaller u-sided\npolygon. To build a group out of such operations, however, we will need to ensure that ru and rv\nboth apply in the context of v-sided polygons. This can be done by extending Aru to v by de\ufb01ning\nru to be the composition of two operations: one which re\ufb02ects the u portion of a string and leaves\nthe rest \ufb01xed, and another which re\ufb02ects the remaining (v \u2212 u)-substring while leaving the \ufb01rst\nu-substring \ufb01xed. In this case, one will notice that ru can be written in terms of rotations and the\nusual re\ufb02ection rv:\nThis also implies that for any x \u2208 T ,\n\nru = rvt\u2212u = turv .\n\n(11)\n\n{rxr\u22121 | r \u2208 (cid:104)rv(cid:105)} = {rxr\u22121 | r \u2208 (cid:104)rv, ru(cid:105)},\n\nwhere we have used the fact that T is abelian, and applied the relations in Equation (10). We can\nnow make an educated guess as to the form of \u02dcT by starting with Condition (1) of De\ufb01nition 3 and\napplying the relations appearing in Equation (10). Given x \u2208 \u02dcT , a reasonable requirement is that\nthere must exist an x(cid:48) \u2208 \u02dcT such that rvx = x(cid:48)ru. In this case\n\n\u02dcR = R,\n\nx(cid:48) = rvxru = rvxrvt\u2212u = x\u22121rvrvt\u2212u = x\u22121t\u2212u,\n\n(12)\nwhere the second equality follows from Equation (11), and the remaining equalities follow from\nthe relations (10). The following Proposition con\ufb01rms that this choice of \u02dcT is compatible with the\nre\ufb02ection subgroup of G = Dv, and closely parallels Proposition 2.\nProposition 4. Let H be the set of translations associated to an arbitrary layer of the hierarchical\nfeature map and de\ufb01ne the injective map \u03c4 : H \u2192 T by ha (cid:55)\u2192 ta, where a is a parameter charac-\n\u223c= T (cid:111) R, with T = Cn = (cid:104)t(cid:105) and\nterizing the translation. Set \u0393 = {\u03c4(h) | h \u2208 H}. Take G = Dn\nR = C2 = {r, 1}. The sets\n\u02dcT = \u0393 \u222a \u0393\u22121t\u2212u\n\nare compatible.\nOne may also consider non-closed strings, as in Smale et al., in which case substrings which would\nwrap around the edges are disallowed. Proposition 4 in fact points to the minimum \u02dcT for reversals in\nthis scenario as well, noticing that the set of allowed translations is the same set above but with the\nillegal elements removed. If we again take length u substrings of length v strings, this reduced set\nof valid transformations in fact describes the symmetries of a regular (v \u2212 u + 1)-gon. We can thus\napply Proposition 4 working with the Dihedral group G = Dv\u2212u+1 to settle the case of non-closed\nstrings.\n5 Conclusion\nWe have shown that the tools offered by group theory can be pro\ufb01tably applied towards understand-\ning invariance properties of a broad class of deep, hierarchical models. If one knows in advance the\ntransformations to which a model should be invariant, then the translations which must be built into\nthe hierarchy can be described. In the case of images, we showed that the only group to which a\nmodel in the class of interest can be invariant is the group of planar orthogonal operators.\nAcknowledgments\nThis research was supported by DARPA contract FA8650-06-C-7632, Sony, and King Abdullah\nUniversity of Science and Technology.\n\n8\n\n\fReferences\n[1] M. Artin. Algebra. Prentice-Hall, 1991.\n[2] J. Bouvrie, L. Rosasco, and T. Poggio.\n\nSupplementary material for \u201cOn Invariance\nin Hierarchical Models\u201d. NIPS, 2009. Available online: http://cbcl.mit.edu/\npublications/ps/978_supplement.pdf.\n\n[3] K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of\n\npattern recognition unaffected by shift in position. Biol. Cyb., 36:193\u2013202, 1980.\n\n[4] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nworks. Science, 313(5786):504\u2013507, 2006.\n\n[5] D.H. Hubel and T.N. Wiesel. Receptive \ufb01elds and functional architecture of monkey striate\n\ncortex. J. Phys., 195:215\u2013243, 1968.\n\n[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proc. of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[7] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional deep belief networks for scal-\nable unsupervised learning of hierarchical representations. In Proceedings of the Twenty-Sixth\nInternational Conference on Machine Learning, 2009.\n\n[8] B.W. Mel. SEEMORE: Combining color, shape, and texture histogramming in a neurally\n\ninspired approach to visual object recognition. Neural Comp., 9:777\u2013804, 1997.\n\n[9] T. Serre, A. Oliva, and T. Poggio. A feedforward architecture accounts for rapid categorization.\n\nProceedings of the National Academy of Science, 104:6424\u20136429, 2007.\n\n[10] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with\ncortex-like mechanisms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 29:411\u2013\n426, 2007.\n\n[11] S. Smale, L. Rosasco, J. Bouvrie, A. Caponnetto, and T. Poggio. Mathematics of the neu-\navailable online,\n\nral response. Foundations of Computational Mathematics, June 2009.\nDOI:10.1007/s10208-009-9049-1.\n\n[12] H. Wersing and E. Korner. Learning optimized features for hierarchical models of invariant\n\nobject recognition. Neural Comput., 7(15):1559\u20131588, July 2003.\n\n9\n\n\f", "award": [], "sourceid": 978, "authors": [{"given_name": "Jake", "family_name": "Bouvrie", "institution": null}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": null}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": null}]}