{"title": "Deep Scale-spaces: Equivariance Over Scale", "book": "Advances in Neural Information Processing Systems", "page_first": 7366, "page_last": 7378, "abstract": "We introduce deep scale-spaces, a generalization of convolutional neural networks, exploiting the scale symmetry structure of conventional image recognition tasks. Put plainly, the class of an image is invariant to the scale at which it is viewed. We construct scale equivariant cross-correlations based on a principled extension of convolutions, grounded in the theory of scale-spaces and semigroups. As a very basic operation, these cross-correlations can be used in almost any modern deep learning architecture in a plug-and-play manner. We demonstrate our networks on the Patch Camelyon and Cityscapes datasets, to prove their utility and perform introspective studies to further understand their properties.", "full_text": "Deep Scale-spaces: Equivariance Over Scale\n\nDaniel E. Worrall\u2217\nAMLAB, Philips Lab\n\nUniversity of Amsterdam\nd.e.worrall@uva.nl\n\nAMLAB, Philips Lab\n\nUniversity of Amsterdam\n\nMax Welling\n\nm.welling@uva.nl\n\nAbstract\n\nWe introduce deep scale-spaces (DSS), a generalization of convolutional neural net-\nworks, exploiting the scale symmetry structure of conventional image recognition\ntasks. Put plainly, the class of an image is invariant to the scale at which it is viewed.\nWe construct scale equivariant cross-correlations based on a principled extension\nof convolutions, grounded in the theory of scale-spaces and semigroups. As a very\nbasic operation, these cross-correlations can be used in almost any modern deep\nlearning architecture in a plug-and-play manner. We demonstrate our networks\non the Patch Camelyon and Cityscapes datasets, to prove their utility and perform\nintrospective studies to further understand their properties.\n\n1\n\nIntroduction\n\nScale is inherent in the structure of the physical world around us and the measurements we make of it.\nIdeally, the machine learning models we run on this perceptual data should have a notion of scale,\nwhich is either learnt or built directly into them. However, the state-of-the-art models of our time,\nconvolutional neural networks (CNNs) [Lecun et al., 1998], are predominantly local in nature due to\nsmall \ufb01lter sizes. It is not thoroughly understood how they account for and reason about multiscale\ninteractions in their deeper layers, and empirical evidence [Chen et al., 2018, Yu and Koltun, 2015,\nYu et al., 2017] using dilated convolutions suggests that there is still work to be done in this arena.\nIn computer vision, typical methods to circumvent scale are: scale averaging, where multiple scaled\nversions of an image are fed through a network and then averaged [Kokkinos, 2015]; scale selection,\nwhere an object\u2019s scale is found and local computations are adapted accordingly [Girshick et al.,\n2014, Shelhamer et al., 2019]; and scale augmentation, where multiple scaled versions of an image\nare added to the training set [Barnard and Casasent, 1991]. While these methods help, they lack\nexplicit mechanisms to fuse information from different scales into the same representation. Many\nworks do indeed follow this approach Ke et al. [2017], Saxena and Verbeek [2016], Lin et al. [2017],\nKanazawa et al. [2014], Huang et al. [2018] and in this work, we follow this line of thinking and\nconstruct a generalized convolution taking, as input, information from different scales.\nThe utility of convolutions arises in scenarios where there is a translational symmetry (translation\ninvariance) inherent in the task of interest [Cohen and Welling, 2016a]. Examples of such tasks are\nobject classi\ufb01cation [Krizhevsky et al., 2012], object detection [Girshick et al., 2014], or dense image\nlabelling [Long et al., 2015]. By using translational weight-sharing [Lecun et al., 1998] for these\ntasks, we reduce the parameter count while preserving symmetry in the deeper layers. The overall\neffect is to improve sample complexity and thus reduce generalization error [Sokolic et al., 2017].\nFurthermore, it has been shown that convolutions (and various reparameterizations of them) are the\nonly linear operators that preserve symmetry [Kondor and Trivedi, 2018]. Attempts have been made\nto extend convolutions to scale, but they either suffer from breaking translation symmetry [Henriques\nand Vedaldi, 2017, Esteves et al., 2017], making the assumption that scalings can be modelled in the\nsame way as rotations [Marcos et al., 2018], or ignoring symmetry constraints [Hilbert et al., 2018].\n\n\u2217deworrall92.github.io\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: How to correctly downsample an image. Left to right: The original high-resolution image;\nan \u00d71/8 subsampled image, notice how a lot of the image structure has been destroyed; a high-\nresolution, bandlimited (blurred) image; a bandlimited and \u00d71/8 subsampled image. Compare the\nbandlimited and subsampled image with the na\u00efvely subsampled image. Much of the low-frequency\nimage structure is preserved in the bandlimited and subsampled image. Image source: ImageNet.\n\nThe problem with the aforementioned approaches is that they fail to account for the unidirectional\nnature of scalings. In data there exist many one-way transformations, which cannot be inverted.\nExamples are occlusions, causal translations, downscalings of discretized images, and pixel lighting\nnormalization. In each example the transformation deletes information from the original signal,\nwhich cannot be regained, and thus it is non-invertible. We extend convolutions to these classes of\nsymmetry under noninvertible transformations via the theory of semigroups. Our contributions are\nthe introduction of a semigroup equivariant correlation and a scale-equivariant CNN.\n\n2 Background\n\nThis section introduces some key concepts such as groups, semigroups, actions, equivariance, group\nconvolution, and scale-spaces. These concepts are presented for the bene\ufb01t of the reader, who is not\nexpected to have a deep knowledge of any of these topics a priori.\nDownsizing images We consider sampled images f \u2208 L2(Z2) such as in Figure 1. For image f, x\nis pixel position and f (x) is pixel intensity. If we wish to downsize by a factor of 8, a na\u00efve approach\nwould be to subsample every 8th pixel: fdown(x) = f (8x). This leads to an artifact, aliasing [Mallat,\n2009, p.43], where the subsampled image contains information at a higher-frequency than can be\nrepresented by its resolution. The \ufb01x is to bandlimit pre-subsampling, suppressing high-frequencies\nwith a blur. Thus a better model for downsampling is fdown(x) = [G \u2217Zd f ](8x), where \u2217Zd denotes\nconvolution over Zd, and G is an appropriate blur kernel (discussed later). Downsizing involves\nnecessary information loss and cannot be inverted [Lindeberg, 1997]. Thus upsampling of images\nis not well-de\ufb01ned, since it involves imputing high-frequency information, not present in the low\nresolution image. As such, in this paper we only consider image downscaling.\n\nScale-spaces Scale-spaces have a long history, dating back to the late \ufb01fties with the work of Iijima\n[1959]. They consist of an image f0 \u2208 L2(Rd) and multiple blurred versions of it. Although sampled\nimages live on Zd, scale-space analysis tends to be over Rd, but many of the results we present are\nvalid on both domains. Among all variants, the Gaussian scale-space (GSS) is the commonest [Witkin,\n1983]. Given an initial image f0, we construct a GSS by convolving f0 with an isotropic (rotation\nt and\n\ninvariant) Gauss-Weierstrass kernel G(x, t) = (4\u03c0t)\u2212d/2 exp(cid:8)\u2212(cid:107)x(cid:107)2/4t(cid:9) of variable width\n\n\u221a\n\nspatial positions x. The GSS is the complete set of responses f (t, x):\n\nf (t, x) = [G(\u00b7, t) \u2217Rd f0](x),\nf (0, x) = f0(x),\n\n(1)\n(2)\nwhere \u2217Rd denotes convolution over Rd. The higher the level t (larger blur) in the image, the more\nhigh frequency details are removed. An example of a scale-space f (t, x) can be seen in Figure 2.\nAn interesting property of scale-spaces is the semigroup property [Florack et al., 1992], sometimes\nreferred to as the recursivity principle [Pauwels et al., 1995], which is\n\nt > 0\n\nf (s + t,\u00b7) = G(\u00b7, s) \u2217Rd f (t,\u00b7)\n\n(3)\n\n2\n\nOriginalNaive subsamplingBandlimitedBandlimited + Subsampled\fIt says that we can generate a scale-space from other levels of the scale-space,\nfor s, t > 0.\nnot just from f0. Furthermore, since s, t > 0 it also says that we cannot generate sharper\nimages from blurry ones, using just a Gaussian convolution. Thus moving to blurrier levels\nencodes a degree of information loss. This property emanates from the closure of Gaussians\nunder convolution, namely for multidimensional Gaussians with covariance matrices \u03a3 and T\n\nG(\u00b7, \u03a3 + T ) = G(\u00b7, \u03a3) \u2217Rd G(\u00b7, T ).\n\n(4)\nWe assume the initial image f0 has a maximum spatial frequency\ncontent\u2014dictated by pixel pitch in discretized images\u2014which we\nmodel by assuming the image has already been convolved with a width\ns0 Gaussian, which we call the zero-scale. Thus an image of bandlimit\ns in the scale-space is found at GSS slice f (s \u2212 s0,\u00b7), which we see\nfrom Equation 3. There are many varieties of scale-spaces: the \u03b1-\nscale-spaces [Pauwels et al., 1995], the discrete Gaussian scale-spaces\n[Lindeberg, 1990], the binomial scale-spaces [Burt, 1981], etc. These\nall have speci\ufb01c kernels, analogous to the Gaussian, which are closed\nunder convolution (details in supplement).\nSlices at level t in the GSS correspond to images downsized by a dila-\ntion factor 0 \u2264 a \u2264 1 (1 = no downsizing, 0 = shrinkage to a point).\nAn a-dilated and appropriately bandlimited image p(a, x) is found as\n(details in supplement)\n\np(a, x) = f (t(a, s0), a\u22121x),\n\nt(a, s0) :=\n\ns0\n\na2 \u2212 s0.\n\n(5)\n\nFigure 2: A Scale-space:\nFor implementations we log-\narithmically discretize the\nscale-axis in a\u22121.\n\nFor clarity, we refer to decreases in the spatial dimensions of an image as dilation and increases in the\nblurriness of an image as scaling. For a generalization to anisotropic scaling we replace scalar scale t\nwith matrix T , zero-scale s0 with covariance matrix \u03a30, and dilation parameter a with matrix A, so\n\nT (A, \u03a30) = A\u22121\u03a30A\u2212T \u2212 \u03a30.\n\n(6)\n\nSemigroups The semigroup property of Equation 3 is the gateway between classical scale-spaces\nand the group convolution [Cohen and Welling, 2016a] (see end of section). Semigroups (S,\u25e6) consist\nof a non-empty set S and a (binary) composition operator \u25e6 : S \u00d7 S \u2192 S. Typically, the composition\ns\u25e6 t of elements s, t \u2208 S is abbreviated st. For our purposes, these individual elements will represent\ndilation parameters. For S to be a semigroup, it must satisfy the following two properties\n\n\u2022 Closure: st \u2208 S for all s, t \u2208 S\n\u2022 Associativity: (st)r = s(tr) = str for all s, t, r \u2208 S\n\nNote that commutativity is not a given, so st (cid:54)= ts. The family of Gaussian densities under spatial\nconvolution is a semigroup2. Semigroups are a generalization of groups, which are used in Cohen\nand Welling [2016a] to model invertible transformations. For a semigroup to be a group, it must also\nsatisfy the following conditions\n\n\u2022 Identity element: there exists an e \u2208 S such that es = se = s for all s \u2208 S\n\u2022 Inverses: for each s \u2208 S there exists a s\u22121 such that s\u22121s = ss\u22121 = e.\n\nActions Semigroups are useful because we can use them to model transformations, also known\nas (semigroup) actions. Given a semigroup S, with elements s \u2208 S and a domain X, the action\ns [x] for x \u2208 X. The de\ufb01ning property of the semigroup\ns : X \u2192 X is a map, also written x (cid:55)\u2192 LX\nLX\naction is that it is associative and closed under composition, inheriting its compositional structure\nfrom the semigroup S. There are, in fact, two versions of the action, a left and a right action (Equation\n7). For the left action LX\nst, this is reversed.\n(7)\n2For the Gaussians we \ufb01nd the identity element as the limit limt\u21930 G(\u00b7, t), which is a Dirac delta. Note that\n\ns , but for the right action RX\nRight action: RX\nt [RX\n\nst, we \ufb01rst apply LX\nt\n\nLeft action: LX\n\nst[x] = LX\n\ns [LX\n\nt [x]]\n\nst[x] = RX\n\ns [x]].\n\nthis element is not strictly in the set of Gaussians, thus the Gaussian family has no identity element.\n\nthen LX\n\n3\n\n\fActions can also be applied to functions f : X \u2192 Y by viewing f as a point in a function space\nF (X). Note, when the function domain is obvious, we just write F for brevity. The result of the\naction is a new function, denoted LF\ns [f ](x).\nSay the domain is a semigroup S then an example left action is LF\ns [x]) = f (xs).\nThis highlights a connection between left actions on functions and right actions on their domains,\nwhich we return to later in our discussion on nonlinearities. Another example, which we shall use\nlater, is the action S\u03a30\n\ns [f ]. Since the domain of f is X, we commonly write LF\n\nA,z used to form scale-spaces, namely\n\ns [f ](x) = f (RX\n\nS\u03a30\nA,z[f0](x) = [G\u03a30\n\nA \u2217Zd f0](A\u22121x + z),\n\nA := G(\u00b7, A\u22121\u03a30A\u2212(cid:62) \u2212 \u03a30).\nG\u03a30\n\n(8)\n\nThe elements of the semigroup are the tuples (A, z) (dilation A, shift z) and G\u03a30\ndiscrete Gaussian. The action \ufb01rst bandlimits by G\u03a30\nA\u22121x + z. Note for \ufb01xed (A, z) this maps functions on Zd to functions on Zd.\n\nA is an anisotropic\nA \u2217Zd f0 then dilates and shifts the domain by\n\nLifting We can also view actions as maps from functions on X to functions on the semigroup S, so\nF (X) \u2192 F (S). We call this a lift [Kondor and Trivedi, 2018], denoting lifted functions as f\u2191. One\nexample is the scale-space action (Equation 8). If we set x = 0, then\n\nf\u2191(A, z) = S\u03a30\n\nA,z[f0](0) = [G\u03a30\n\nA \u2217Rd f0](z),\n\n(9)\n\nwhich is the expression for an anisotropic scale-space (parameterized by the dilation A rather than\nthe scale T (A, \u03a30) = A\u22121\u03a30A\u2212T \u2212 \u03a30). To lift a function on to a semigroup, we do not necessarily\nhave to set x equal to a constant, but we could also integrate it out. An important property of lifted\nfunctions is that actions become simpler. For instance, if we de\ufb01ne f\u2191(s) = LF\n\ns [f ](0) then\n\nt [f ])\u2191(s) = LF\n\n(LF\n\n(10)\nThe action on f could be complicated, like a Gaussian blur, but the action on f\u2191 is simply a \u2018shift\u2019 on\nf\u2191(s) (cid:55)\u2192 f\u2191(st). We can then de\ufb01ne the action LF (S)\n[f\u2191](s) = f\u2191(st).\nThis action is another example of where the left action on the function is a right action on the domain.\n\non lifted functions as LF (S)\n\ns [Lt[f ]](0) = LF\n\nt\n\nt\n\nst[f ](0) = f\u2191(st).\n\nEquivariance and Group Correlations CNNs rely heavily on the (cross-)correlation3. Correla-\ntions (cid:63)Zd are a special class of linear map of the form\n\n[f (cid:63)Zd \u03c8](s) =\n\nf (x)\u03c8(x \u2212 s).\n\n(11)\n\n(cid:88)\n\nx\u2208Zd\n\n(cid:88)\n\nx\u2208X\n\nGiven a signal f and \ufb01lter \u03c8, we interpret the correlation as the collection of inner products of f with\nall s-translated versions of \u03c8. This basic correlation has been extended to transformations other than\ntranslation via the group correlation, which as presented in Cohen and Welling [2016a] is\n\n[f (cid:63)H \u03c8](s) =\n\nf (x)\u03c8(LX\n\ns\u22121[x]),\n\n(12)\n\nwhere H is the relevant group and LX\ns [x] = Rsx, where Rs\nis a rotation matrix. Most importantly, the domain of the output is H. It highlights how this is an inner\nproduct of f and \u03c8 under all s-transformations of the \ufb01lter. If we denote4 LF\ns\u22121 [x]),\nthe correlation exhibits a special property. It is equivariant under actions of the group H. In math\n\ns is a group action e.g. for 2D rotation LX\n\ns [f ](x) = f (LX\n\nLF (S)\n\n(13)\nGroup correlation followed by the action is equivalent to the action followed by the group correlation,\nalbeit the action is over a different domain. Note, the group action may \u2018look\u2019 different depending\non whether it was applied before or after the group correlation, but it represents the exact same\ntransformation.\n\n[f (cid:63)H \u03c8] = LF (X)\n\n[f ] (cid:63)H \u03c8.\n\ns\n\ns\n\nNotation As a shorthand, we just write Ls from now on; the domain of the action should be obvious\nfrom context.\n\n3In the deep learning literature these are inconveniently referred to as convolutions, but we stick to correlation.\n4Recall how earlier we gave an example of left actions on functions as LF\ns [x]) = f (xs), the\n\ns [f ](x) = f (RX\n\ngroup action we present here is an example of this, because LX\n\ns\u22121 [x] is a right action.\n\n4\n\n\f3 Method\n\nWe aim to construct a scale-equivariant convolution. We shall achieve this here by introducing an\nextension of the correlation to semigroups, which we then tailor to scalings.\n\nSemigroup Correlation There are multiple candidates for a semigroup correlation (cid:63)S. The basic\ningredients of such a correlation will be the inner product, the semigroup action Ls, and the functions\n\u03c8 \u2208 F and f \u2208 F . Furthermore, it must be equivariant to (left) actions on f. For a semigroup S,\ndomain X, and action Ls : F \u2192 F , we de\ufb01ne:\n[\u03c8 (cid:63)S f ](s) =\n\n\u03c8(x)Ls[f ](x).\n\n(cid:88)\n\n(14)\n\nx\u2208X\n\nIt is the set of responses formed from taking the inner product between a \ufb01lter \u03c8 and a signal f under\nall transformations of the signal. Notice that we transform the signal and not the \ufb01lter and that we\nwrite \u03c8 (cid:63)S f, not f (cid:63)S \u03c8\u2014it turns out that a similar expression where we apply the action to the \ufb01lter\nis not equivariant to actions on the signal. Furthermore this expression lifts a function from X to S,\nso we expect actions on f to look like a \u2018shift\u2019 on the semigroup. A proof of equivariance to (left)\nactions on f is as follows\n\n[\u03c8 (cid:63) Lt[f ]](s) =\n\n\u03c8(x)Ls[Lt[f ]](x) =\n\n\u03c8(x)Lst[f ](x) = [\u03c8 (cid:63) f ](st) = Lt[\u03c8 (cid:63) f ](s) (15)\n\n(cid:88)\n\nx\u2208X\n\n(cid:88)\n\nx\u2208X\n\nWe have used the de\ufb01nition of the left action LsLt = Lst, the semigroup correlation, and our\nde\ufb01nition of the action for lifted functions. We can recover the standard \u2018convolution\u2019, by substituting\nS = Zd, X = Zd, and the translation action Ls[f ](x) = f (x + s):\n\n[\u03c8 (cid:63)Zd f ](s) =\n\n\u03c8(x)f (x + s) =\n\n\u03c8(x(cid:48) \u2212 s)f (x(cid:48)).\n\n(16)\n\nwhere x(cid:48) = x + s. We can also recover the group correlation by setting S = H, X = H, where H is\na discrete group, and Ls[f ](x) = f (Rsx), where Rs is a right action acting on the domain X = H:\n\n[\u03c8 (cid:63)H f ](s) =\n\n\u03c8(x)f (Rs[x]) =\n\n\u03c8(Rs\u22121 [x(cid:48)])f (x(cid:48)).\n\n(17)\n\nwhere x(cid:48) = Rs[x] and since Rs is a group action inverses exist, so x = Rs\u22121[x(cid:48)]. The semigroup\ncorrelation has a two notable differences from the group correlation: i) In the semigroup correlation\nwe transform the signal and not the \ufb01lter. When we restrict to the group and standard convolution,\ntransforming the signal or the \ufb01lter are equivalent operations since we can apply a change of variables.\nThis is not possible in the semigroup case, since this change of variables requires an inverse, which\nwe do not necessarily have. ii) In the semigroup correlation, we apply an action to the whole signal as\nLs[f ], as opposed to just the domain (f (Rs[x])). This allows for more general transformations than\nallowed by the group correlation of Cohen and Welling [2016a], since transformations of the form\nf (Rs[x]) can only move pixel locations, but transformations of the form Ls[f ] can alter the values of\nthe pixels as well, and can incorporate neighbourhood information into the transformation.\n\nThe Scale-space Correlation We now have the tools to create a scale-equivariant correlation. All\nwe have to choose is an appropriate action for Ls. We choose the scale-space action of Equation 8.\nThe scale-space action S\u03a30\n\nz,A for functions on Zd is given by\n\nS\u03a30\nA,z[f ](x) = [G\u03a30\n\nA \u2217Zd f ](A\u22121x + z),\n\nS\u03a30\nA,zS\u03a30\n\nB,y = S\u03a30\n\nAB,Ay+z.\n\n(18)\n\nSince our scale-space correlation only works for discrete semigroups, we have to \ufb01nd a suitable\ndiscretization of the dilation parameter A. Later on we will choose a discretization of the form\nAk = 2\u2212kI for k \u2265 0, but for now we will just assume that there exists some countable set A, such\nthat S = {(A, z)A\u2208A,z\u2208Zd} is a valid discrete semigroup. We begin by assuming we have lifted an\ninput image f on to the scale-space via Equation 9. The lifted signal is indexed by coordinates A, z\nand so the \ufb01lters share this domain and are of the form \u03c8(A, z). The scale-space correlation is then\n\n[\u03c8 (cid:63)S f ](A, z) =\n\n(cid:88)\n\n(B,y)\u2208S\n\n(cid:88)\n\n(cid:88)\n\nB\u2208A\n\ny\u2208Zd\n\n\u03c8(B, y)S\u03a30\n\nB,y[f ](A, y) =\n\n5\n\n(cid:88)\n\nx\u2208Zd\n\n(cid:88)\n\nx\u2208H\n\n(cid:88)\n\nx(cid:48)\u2208Zd\n\n(cid:88)\n\nx(cid:48)\u2208H\n\n\u03c8(B, y)f (BA, A\u22121y + z).\n\n(19)\n\n\fFigure 3: Scale correlation schematic: The left 3 stacks are the same input f, with levels f(cid:96)(\u00b7) =\nf (2\u2212(cid:96)I,\u00b7). Each stack shows the inner product between \ufb01lter \u03c8 (in green) at translation z for dilation\n2\u2212kI corresponding to the output level [\u03c8 (cid:63)S f ]k on the right with matching color. Notice that as we\ndilate the \ufb01lter, we also shift it one level up in the scale-space, accoridng to Equation 20.\n\nFor the second equality we recall the action on a lifted signal is governed by Equation 10. The\nappealing aspect of this correlation is that we do not need to convolve with a bandlimiting \ufb01lter\u2014a\npotentially expensive operation to perform at every layer of a CNN\u2014since we use signals that have\nbeen lifted on to the semigroup. Instead, the action of scaling by A is accomplished by \u2018fetching\u2019\na slice f (BA,\u00b7) from a blurrier level of the scale-space. Let\u2019s restrict the scale correlation to the\nscale-space where Ak = 2\u2212kI for k \u2265 0, with zero-scale \u03a30 = 1\n4 I. Denoting f (2\u2212kI,\u00b7) as fk(\u00b7),\nthis can be seen as a dilated convolution [Yu and Koltun, 2015] between \u03c8(cid:96) and slice f(cid:96)+k. This form\nof the scale-space correlation (shown below) we use in our experiments. A diagram of this can be\nseen in Figure 3:\n\n[\u03c8 (cid:63)S f ]k(z) =\n\n\u03c8(cid:96)(y)f(cid:96)+k(2ky + z).\n\n(20)\n\n(cid:88)\n\n(cid:88)\n\n(cid:96)\u22650\n\ny\u2208Zd\n\nEquivariant Nonlinearities Not all nonlinearities commute with semigroup actions, but it turns\nout that pointwise nonlinearities \u03bd commute with a special subset of actions of the form,\n\n(21)\nwhere Rs is a right action on the domain of f. For these sorts of actions, we cannot alter the values\nof the f, just the locations of the values. If we write function composition as [\u03bd \u2022 f ](x) = \u03bd(f (x)),\nthen a proof of equivariance is as follows:\n\nLs[f ](x) = f (Rs[x])\n\n[\u03bd \u2022 Ls[f ]](x) = \u03bd(f (Rs[x])) = [\u03bd \u2022 f ](Rs[x]) = Ls[\u03bd \u2022 f ](x).\n\n(22)\nEquation 21 may at \ufb01rst glance seem overly restrictive, but it turns out that this is not the case. Recall\nthat for functions lifted on to the semigroup, the action is Ls[f ](x) = f (xs). This satis\ufb01es Equation\n21, and so we are free to use pointwise nonlinearities.\n\nBatch normalization For batch normalization, we compute batch statistics over all dimensions of\nan activation tensor except its channels, as in Cohen and Welling [2016a].\n\nInitialization Since our correlations are based on dilated convolutions, we use the initialization\nscheme presented in [Yu and Koltun, 2015]. For pairs of input and output channels the center pixel of\neach \ufb01lter is set to one and the rest are \ufb01lled with random noise of standard deviation 10\u22122.\n\nBoundary conditions\nIn our semigroup correlation, the scale dimension is in\ufb01nite\u2014this is a\nproblem for practical implementations. Our solution is to truncate the scale-space to \ufb01nite scale.\nThis breaks global equivariance to actions in S, but is locally correct. Boundary effects occur at\nactivations with receptive \ufb01elds covering the truncation boundary. To mitigate these effects we: i) use\n\ufb01lters with a scale-dimension no larger than two, ii) interleave \ufb01lters with scale-dimension 2 with\n\ufb01lters of scale-dimension 1. The scale-dimension 2 \ufb01lters enable multiscale interactions but they\npropagate boundary effects; whereas, the scale-dimension 1 kernels have no boundary overlap, but\nalso no multiscale behavior. Interleaving trades off network expressiveness against boundary effects.\n\n6\n\n\fTable 1: Results on the Patch Camelyon and Cityscapes Dataset. Higher is better. Our scale-\nequivariant model outperforms the matched baselines. We must caution that better competing results\ncan be found in the literature, when the computational constraint is relaxed. For instance, Shelhamer\net al. [2019] report a mAP of 71.4 on a ResNet-34, which is deeper than our model by 15 layers.\n\nPCam Model\nDenseNet Baseline\nS-DenseNet (Ours)\n[Veeling et al., 2018]\n\nAccuracy\n\n87.0\n88.1\n89.8\n\nCityscapes Model\nResNet, matched parameters\nResNet, matched channels\nS-ResNet, multiscale (Ours)\nS-ResNet, no interaction (Ours)\n\nmAP\n45.66\n49.99\n63.53\n64.78\n\nScale-space implementation We use a 4 layer scale-space and zero-scale 1/4 with dilations at\ninteger powers of 2, the maximum dilation is 8 and kernel width 33 (4 std. of a discrete Gaussian).\nWe use the discrete Gaussian of Lindeberg [1990]. In 1D for scale parameter t, this is\n\nG(x, t) = e\u2212tI|x|(t)\n\n(23)\n\nwhere Ix(t) is the modi\ufb01ed Bessel function of integer order. For speed, we make use of the separability\nof isotropic kernels. For instance, convolution with a 2D Gaussian can we written as a convolution\nwith 2 identical 1D Gaussians sequentially along the x and then the y-axis. For an N \u00d7 N input\nimage and M \u00d7 M blurring kernel, this reduces computational complexity of the convolution as\nO(M 2N 2) \u2192 O(2M N 2). With GPU parallelization, this saving is O(M 2) \u2192 O(M ), which is\nespecially signi\ufb01cant for us since the largest blur kernel we use has M = 33.\n\nMulti-channel features Typically CNNs use multiple channels per activation tensor, which we\nhave not included in our above treatment. In our experiment we include input channels i and output\nchannels o, so a correlation layer is\n\n\u03c8i,o\n(cid:96) (x)f i\n\n(cid:96)+k(2kx + z).\n\n(24)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:96)\u22650\n\nx\u2208Zd\n\ni\n\n[\u03c8 (cid:63)S f ]o\n\nk(z) =\n\n4 Experiments and Results\n\nHere we present results for some preliminary simple experiments on the Patch Camelyon [Veeling\net al., 2018] and Cityscapes [Cordts et al., 2016] datasets. Due to the extra computational overhead\nof computing the scale-equivariance cross-correlation, the experiments are more an indicator of what\nour method is capable of, but please note that the baselines are non state-of-the-art reimplementations,\nrestricted in size for fair-comparison rather than to prove we are beating any benchlines. We also\nvisualize the quality of scale-equivariance achieved.\n\nPatch Camelyon The Patch Camelyon or PCam dataset [Veeling et al., 2018] contains 327 680\ntiles from two classes, metastatic (tumorous) and non-metastatic tissue. Each tile is a 96 \u00d7 96 px\nRGB-crop labelled as metastatic if there is at least one pixel of metastatic tissue in the central 32\u00d7 32\npx region of the tile. We test a 4-scale DenseNet model [Huang et al., 2017], \u2018S-DenseNet\u2019, on\nthis task (architecture in supplement). We also train a scale non-equivariant DenseNet baseline\nand the rotation equivariant model of Veeling et al. [2018]. Our training procedure is: 100 epochs\nSGD, learning rate 0.1 divided by 10 every 40 epochs, momentum 0.9, batch size of 512, split\nover 4 GPUs. For data augmentation, we follow the procedure of Veeling et al. [2018], Liu et al.\n[2017], using random \ufb02ips, rotation, and 8 px jitter. For color perturbations we use: brightness delta\n64/255, saturation delta 0.25, hue delta 0.04, constrast delta 0.75. The evaluation metric we test on is\naccuracy. The results in Table 1 show that both the scale and rotation equivariant models outperform\nthe computation-matched baseline.\n\nCityscapes The Cityscapes dataset [Cordts et al., 2016] contains 2975 training images, 500 valida-\ntion images, and 1525 test images of resolution 2048 \u00d7 1024 px. The task is semantic segmentation\ninto 19 classes. We train a 4-scale ResNet He et al. [2016], \u2018S-ResNet\u2019, and baseline. We train an\nequivariant network with and without multiscale interaction layers. We also train two scale non-\nequivariant models, one with the same number of channels, one with the same number of parameters.\nOur training procedure is: 100 epochs Adam, learning rate 10\u22123 divided by 10 every 40 epochs, batch\n\n7\n\n\fsize 8, split over 4 GPUs. The results are in Table 1. The evaluation metric is mean average precision.\nWe see that our scale-equivariant model outperforms the baselines. We must caution however, that\nbetter competing results can be found in the literature. For instance, Shelhamer et al. [2019] report a\nmAP of 71.4 on a ResNet-34, which is deeper than our model by 15 layers. That said they train with\ncontinuous scale augmentation, which we do not. The reason our baseline underperforms compared\nto the literature is because of the parameter/channel-matching, which have shrunk its size somewhat\ndue to our own resource constraints. On a like-for-like comparison scale-equivariance appears to\nhelp.\n\nQuality of Equivariance We validate the quality of equivariance empirically by comparing acti-\nvations of a dilated image against the theoretical action on the activations. Using \u03a6 to denote the\ndeep scale-space (DSS) mapping, we compute the normalized L2-distance at each level k of a DSS.\nMathematically this is\n\nL(2\u2212(cid:96), k) =\n\n(cid:107)\u03a6[f ](k + (cid:96), 2(cid:96)\u00b7) \u2212 \u03a6[S1/4I\n(cid:107)\u03a6[f ](k + (cid:96), 2(cid:96)\u00b7)(cid:107)2\n\n2\u2212(cid:96),0[f ]](k,\u00b7)(cid:107)2\n\n.\n\n(25)\n\nThe equivariance errors are in Figure 4 for 3 DSSs with random weights and a scale-space trucated to\n8 scales. We see that the average error is below 0.01, indicating that the network is mostly equivariant,\nwith errors due to truncation of the discrete Gaussian kernels used to lift the input to scale-space. We\nalso see that the equivariance errors blow up for constant (cid:96) + k in each graph. This is the point where\nthe receptive \ufb01eld of an activation overlaps with the scale-space truncation boundary.\n\n5 Related Work\n\nIn recent years, there have been a number of works on group convolutions, namely continuous\nroto-translation in 2D [Worrall et al., 2017] and 3D [Weiler et al., 2018a, Kondor et al., 2018, Thomas\net al., 2018] and discrete roto-translations in 2D [Cohen and Welling, 2016a, Weiler et al., 2018b,\nBekkers et al., 2018, Hoogeboom et al., 2018] and 3D [Worrall et al., 2017], continuous rotations\non the sphere [Esteves et al., 2018, Cohen et al., 2018b], in-plane re\ufb02ections [Cohen and Welling,\n2016a], and even reverse-conjugate symmetry in DNA sequencing [Lunter and Brown, 2018]. Theory\nfor convolutions on compact groups\u2014used to model invertible transformations\u2014also exists [Cohen\nand Welling, 2016a,b, Kondor and Trivedi, 2018, Cohen et al., 2018a], but to date the vast majority\nof work has focused on rotations.\nFor scale, there far fewer works with explicit scale equivariance. Henriques and Vedaldi [2017],\nEsteves et al. [2017] both perform a log-polar transform of the signal before passing to a standard\nCNN. Log-polar transforms reparameterize the input plane into angle and log-distance from a\nprede\ufb01ned origin. The transform is sensitive to origin positioning, which if done poorly breaks\ntranslational equivariance. Marcos et al. [2018] use a group CNN architecture designed for roto-\ntranslation, but instead of rotating \ufb01lters in the group correlation, they scale them. This seems to\nwork on small tasks, but ignores large scale variations. Kanazawa et al. [2014] convolve the same\n\ufb01lter over rescaled versions of the same feature map and then max-pool over feature location, for\nlocal scale-invariance. Hilbert et al. [2018] instead use \ufb01lters of different sizes, but without any\nattention to equivariance. Ke et al. [2017] introduce a multigrid convolution, where the convolution\noutputs multiple feature maps at different resolutions. The input to each resolution convolution is\na concatenation of rescaled feature maps from the previous layer. This is similar to our work, but\ndiffering in two ways, 1) there is no across-scale weight-tying (so no explicit equivariance), and 2)\n\nFigure 4: Equivariance quality. Left to right: 1, 2, and 3 layer DSSs. Each line represents the error as\nin Equation 25. We see the residual error is typically < 0.01 until boundary effects are present.\n\n8\n\n202\u221212\u221222\u221232\u221242\u221252\u221262\u22127Dilation0.000.010.020.030.040.05Residual1 Layerk=0k=1k=2k=3k=4202\u221212\u221222\u221232\u221242\u221252\u221262\u22127Dilation0.000.010.020.030.040.05Residual2 Layerk=0k=1k=2k=3k=4202\u221212\u221222\u221232\u221242\u221252\u221262\u22127Dilation0.000.010.020.030.040.05Residual3 Layerk=0k=1k=2k=3k=4\fthey maintain feature maps at difference resolutions rather than different scalings (bandlimits). Two\nother works, with similar approaches are Huang et al. [2018], Saxena and Verbeek [2016], but their\ngoals are not scale equivariance, but architecture search. Feature Pyramid Networks [Lin et al., 2017]\nalso uses multiscale features, using a scheme similar to a UNet [Ronneberger et al., 2015], but where\npredictions are made using every resolution of decoded feature maps. An interesting work with a very\ndifferent approach is Shelhamer et al. [2019], where \ufb01lters are formed as the convolution of a base\n\ufb01lter and an anisotropic Gaussian \ufb01lter, whose covariance is predicted at test time. This produces\nscale-adaptive \ufb01lters.\n\n6 Discussion, Limitations, and Future Works\n\nWe found our best performing architectures were composed mainly of correlations where the \ufb01lters\u2019\nscale dimension is one, interleaved with correlations where the scale dimension is higher. This is\nsimilar to a network-in-network [Lin et al., 2013] architecture, where 3 \u00d7 3 convolutional layers\nare interleaved with 1 \u00d7 1 convolutions. We posit this is the case because of boundary effects, as\nwere observed in Figure 4. Further to boundary effects, we also suspect that using non-integer\ndilations with \ufb01ner increments in scale would improve performance greatly, as often witnessed in\nthe scale-space literature. This would, however, involve the development of non-integer dilations\nand hence interpolation. We see working on mitigating boundary effects, and using a semigroup\ncorrelation for non-integer scalings as an important future work, not just for scale-equivariance, but\nCNNs as a whole.\nAnother limitation of the current model is the increase in computational overhead, since we have\nadded an extra dimension to the activations. This may not be a problem long term, as GPUs grow\nin speed and memory, but the computational complexity of a correlation grows exponentially in the\nnumber of symmetries of the model, and so we need more ef\ufb01cient methods to perform correlation,\neither exactly or approximately. In terms of experimentation, this extra computation limited our\nability to compare against large state-of-the-art models, perhaps giving our model an unfair advantage\nin this limited scheme.\nIn terms of the experiments, a much more in depth exploration of the empirical properties of the scale-\nequivariant correlation is needed. Our light proof-of-concept experiments provide some evidence that\nbuilt-in scale equivariance can help, but before this is going to be useful in practise, we need to solve\nthe issues of ef\ufb01ciency, non-integer dilations, and the boundary effects due to scale-space truncation.\nWe see semigroup correlations as an exciting new family of operators to use in deep learning. We have\ndemonstrated a proof-of-concept on scale, but there are many semigroup-structured transformations\nleft to be explored, such as causal shifts, occlusions, and af\ufb01ne transformations. Concerning scale,\nwe are also keen to explore how multi-scale interactions can be applied on other domains as meshes\nand graphs, where symmetry is less well-de\ufb01ned.\n\n7 Conclusion\n\nWe have presented deep scale-spaces a generalization of convolutional neural networks, exploiting\nthe scale symmetry structure of conventional image recognition tasks. We outlined new theory for a\ngeneralization of the convolution operator on semigroups, the semigroup correlation. Then we showed\nhow to derive the standard convolution and the group convolution [Cohen and Welling, 2016a] from\nthe semigroup correlation. We then tailored the semigroup correlation to the scale-translation action\nused in classical scale-space theory and demonstrated how to use this in modern neural architectures.\n\nAcknowledgements\n\nWe thanks Koninklijke Philips N.V. for in-cash and in-kind support of this research. We also\nthank Rianne van den Berg, Patrick Forr\u00e9, and the anonymous reviewers who all made important\ncontributions to this paper.\n\n9\n\n\fReferences\nE. Barnard and D. Casasent. Invariance and neural nets. IEEE Transactions on Neural Networks, 2\n\n(5):498\u2013508, Sep. 1991. ISSN 1045-9227. doi: 10.1109/72.134287.\n\nErik J. Bekkers, Maxime W. Lafarge, Mitko Veta, Koen A. J. Eppenhof, Josien P. W. Pluim, and\nRemco Duits. Roto-translation covariant convolutional networks for medical image analysis. In\nMedical Image Computing and Computer Assisted Intervention - MICCAI 2018 - 21st International\nConference, Granada, Spain, September 16-20, 2018, Proceedings, Part I, pages 440\u2013448, 2018.\ndoi: 10.1007/978-3-030-00928-1\\_50.\n\nPeter J Burt. Fast \ufb01lter transform for image processing. Computer graphics and image processing,\n\n16(1):20\u201351, 1981.\n\nLiang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille.\nDeeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully\nconnected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834\u2013848, 2018. doi: 10.1109/\nTPAMI.2017.2699184.\n\nTaco Cohen and Max Welling. Group equivariant convolutional networks. In Proceedings of the\n33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June\n19-24, 2016, pages 2990\u20132999, 2016a.\n\nTaco Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homogeneous\n\nspaces. CoRR, abs/1811.02017, 2018a.\n\nTaco S. Cohen and Max Welling. Steerable cnns. CoRR, abs/1612.08498, 2016b.\nTaco S. Cohen, Mario Geiger, Jonas K\u00f6hler, and Max Welling. Spherical cnns. CoRR, abs/1801.10130,\n\n2018b.\n\nMarius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo\nBenenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban\nscene understanding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition,\nCVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 3213\u20133223, 2016. doi: 10.1109/CVPR.\n2016.350.\n\nCarlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar transformer\n\nnetworks. CoRR, abs/1709.01889, 2017.\n\nCarlos Esteves, Christine Allen-Blanchette, Ameesh Makadia, and Kostas Daniilidis. Learning SO(3)\nequivariant representations with spherical cnns. In Computer Vision - ECCV 2018 - 15th European\nConference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, pages 54\u201370, 2018.\ndoi: 10.1007/978-3-030-01261-8\\_4.\n\nLuc Florack, Bart M. ter Haar Romeny, Jan J. Koenderink, and Max A. Viergever. Scale and the\ndifferential structure of images. Image Vision Comput., 10(6):376\u2013388, 1992. doi: 10.1016/\n0262-8856(92)90024-W.\n\nRoss B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for\naccurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer\nVision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages\n580\u2013587, 2014. doi: 10.1109/CVPR.2014.81.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,\nLas Vegas, NV, USA, June 27-30, 2016, pages 770\u2013778, 2016. doi: 10.1109/CVPR.2016.90.\n\nJo\u00e3o F. Henriques and Andrea Vedaldi. Warped convolutions: Ef\ufb01cient invariance to spatial transfor-\nmations. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017,\nSydney, NSW, Australia, 6-11 August 2017, pages 1461\u20131469, 2017.\n\nAdam Hilbert, Bastiaan Veeling, and Henk Marquering. Data-ef\ufb01cient convolutional neural networks\nfor treatment decision support in acute ischemic stroke. International conference on Medical\nImaging with Deep Learning, 2018.\n\n10\n\n\fEmiel Hoogeboom, Jorn W. T. Peters, Taco S. Cohen, and Max Welling. Hexaconv. CoRR,\n\nabs/1803.02108, 2018.\n\nGao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected\nconvolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition,\nCVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261\u20132269, 2017. doi: 10.1109/CVPR.\n2017.243.\n\nGao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Q. Weinberger.\nMulti-scale dense networks for resource ef\ufb01cient image classi\ufb01cation. In 6th International Confer-\nence on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018,\nConference Track Proceedings, 2018. URL https://openreview.net/forum?id=Hk2aImxAb.\n\nTaizo Iijima. Basic theory of pattern observation. Technical Group on Automata and Automatic\n\nControl, pages 3\u201332, 1959.\n\nAngjoo Kanazawa, Abhishek Sharma, and David W. Jacobs. Locally scale-invariant convolutional\n\nneural networks. CoRR, abs/1412.5104, 2014. URL http://arxiv.org/abs/1412.5104.\n\nTsung-Wei Ke, Michael Maire, and Stella X. Yu. Multigrid neural architectures.\n\nIn 2017\nIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI,\nUSA, July 21-26, 2017, pages 4067\u20134075, 2017. doi: 10.1109/CVPR.2017.433. URL http:\n//doi.ieeecomputersociety.org/10.1109/CVPR.2017.433.\n\nIasonas Kokkinos. Pushing the boundaries of boundary detection using deep learning. International\n\nConference on Learning Representations (ICLR), 2015.\n\nRisi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural\nnetworks to the action of compact groups. In Proceedings of the 35th International Conference on\nMachine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018, pages\n2752\u20132760, 2018.\n\nRisi Kondor, Zhen Lin, and Shubhendu Trivedi. Clebsch-gordan nets: a fully fourier space spherical\nconvolutional neural network. In Advances in Neural Information Processing Systems 31: Annual\nConference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018,\nMontr\u00e9al, Canada., pages 10138\u201310147, 2018.\n\nAlex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep con-\nvolutional neural networks. In Advances in Neural Information Processing Systems 25: 26th\nAnnual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting\nheld December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 1106\u20131114, 2012.\n\nY. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, Nov 1998. ISSN 0018-9219. doi:\n10.1109/5.726791.\n\nMin Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.\n\nTsung-Yi Lin, Piotr Doll\u00e1r, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie.\nFeature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and\nPattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 936\u2013944, 2017. doi:\n10.1109/CVPR.2017.106. URL https://doi.org/10.1109/CVPR.2017.106.\n\nTony Lindeberg. Scale-space for discrete signals. IEEE Trans. Pattern Anal. Mach. Intell., 12(3):\n\n234\u2013254, 1990. doi: 10.1109/34.49051.\n\nTony Lindeberg. On the axiomatic foundations of linear scale-space. In Gaussian Scale-Space\n\nTheory, pages 75\u201397, 1997. doi: 10.1007/978-94-015-8802-7\\_6.\n\nYun Liu, Krishna Gadepalli, Mohammad Norouzi, George E. Dahl, Timo Kohlberger, Aleksey Boyko,\nSubhashini Venugopalan, Aleksei Timofeev, Philip Q. Nelson, Gregory S. Corrado, Jason D. Hipp,\nLily Peng, and Martin C. Stumpe. Detecting cancer metastases on gigapixel pathology images.\nCoRR, abs/1703.02442, 2017.\n\n11\n\n\fJonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic\nsegmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015,\nBoston, MA, USA, June 7-12, 2015, pages 3431\u20133440, 2015. doi: 10.1109/CVPR.2015.7298965.\n\nGerton Lunter and Richard Brown. An equivariant bayesian convolutional network predicts recombi-\n\nnation hotspots and accurately resolves binding motifs. bioRxiv, 2018. doi: 10.1101/351254.\n\nSt\u00e9phane Mallat. A Wavelet Tour of Signal Processing - The Sparse Way, 3rd Edition. Academic\n\nPress, 2009. ISBN 978-0-12-374370-1.\n\nDiego Marcos, Benjamin Kellenberger, Sylvain Lobry, and Devis Tuia. Scale equivariance in cnns\n\nwith vector \ufb01elds. CoRR, abs/1807.11783, 2018.\n\nEric J. Pauwels, Luc J. Van Gool, Peter Fiddelaers, and Theo Moons. An extended class of scale-\ninvariant and recursive scale space \ufb01lters. IEEE Trans. Pattern Anal. Mach. Intell., 17(7):691\u2013701,\n1995. doi: 10.1109/34.391411.\n\nOlaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical\nimage segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI\n2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part\nIII, pages 234\u2013241, 2015. doi: 10.1007/978-3-319-24574-4\\_28. URL https://doi.org/10.\n1007/978-3-319-24574-4_28.\n\nShreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural Information\nProcessing Systems 29: Annual Conference on Neural Information Processing Systems 2016,\nDecember 5-10, 2016, Barcelona, Spain, pages 4053\u20134061, 2016. URL http://papers.nips.\ncc/paper/6304-convolutional-neural-fabrics.\n\nEvan Shelhamer, Dequan Wang, and Trevor Darrell. Blurring the line between structure and learning\nto optimize and adapt receptive \ufb01elds. CoRR, abs/1904.11487, 2019. URL http://arxiv.org/\nabs/1904.11487.\n\nJure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel R. D. Rodrigues. Generalization error of\ninvariant classi\ufb01ers. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence\nand Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 1094\u20131103,\n2017.\n\nNathaniel Thomas, Tess Smidt, Steven M. Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick\nRiley. Tensor \ufb01eld networks: Rotation- and translation-equivariant neural networks for 3d point\nclouds. CoRR, abs/1802.08219, 2018.\n\nBastiaan S. Veeling, Jasper Linmans, Jim Winkens, Taco Cohen, and Max Welling. Rotation\nequivariant cnns for digital pathology. In Medical Image Computing and Computer Assisted\nIntervention - MICCAI 2018 - 21st International Conference, Granada, Spain, September 16-20,\n2018, Proceedings, Part II, pages 210\u2013218, 2018. doi: 10.1007/978-3-030-00934-2\\_24.\n\nMaurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco Cohen. 3d steerable cnns:\nLearning rotationally equivariant features in volumetric data. In Advances in Neural Information\nProcessing Systems 31: Annual Conference on Neural Information Processing Systems 2018,\nNeurIPS 2018, 3-8 December 2018, Montr\u00e9al, Canada., pages 10402\u201310413, 2018a.\n\nMaurice Weiler, Fred A. Hamprecht, and Martin Storath. Learning steerable \ufb01lters for rotation\nequivariant cnns. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR\n2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 849\u2013858, 2018b.\n\nAndrew P. Witkin. Scale-space \ufb01ltering. In Proceedings of the 8th International Joint Conference on\n\nArti\ufb01cial Intelligence. Karlsruhe, FRG, August 1983, pages 1019\u20131022, 1983.\n\nDaniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Harmonic\nnetworks: Deep translation and rotation equivariance. In 2017 IEEE Conference on Computer\nVision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 7168\u2013\n7177, 2017. doi: 10.1109/CVPR.2017.758.\n\n12\n\n\fFisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR,\n\nabs/1511.07122, 2015.\n\nFisher Yu, Vladlen Koltun, and Thomas A. Funkhouser. Dilated residual networks. In 2017 IEEE\nConference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July\n21-26, 2017, pages 636\u2013644, 2017. doi: 10.1109/CVPR.2017.75.\n\n13\n\n\f", "award": [], "sourceid": 4016, "authors": [{"given_name": "Daniel", "family_name": "Worrall", "institution": "University of Amsterdam"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam / Qualcomm AI Research"}]}