{"title": "Deep Neural Networks with Box Convolutions", "book": "Advances in Neural Information Processing Systems", "page_first": 6211, "page_last": 6221, "abstract": "Box filters computed using integral images have been part of the computer vision toolset for a long time. Here, we show that a convolutional layer that computes box filter responses in a sliding manner can be used within deep architectures, whereas the dimensions and the offsets of the sliding boxes in such a layer can be learned as part of an end-to-end loss minimization. Crucially, the training process can make the size of the boxes in such a layer arbitrarily large without incurring extra computational cost and without the need to increase the number of learnable parameters. Due to its ability to integrate information over large boxes, the new layer facilitates long-range propagation of information and leads to the efficient increase of the receptive fields of downstream units in the network. By incorporating the new layer into existing architectures for semantic segmentation, we are able to achieve both the increase in segmentation accuracy as well as the decrease in the computational cost and the number of learnable parameters.", "full_text": "Deep Neural Networks with Box Convolutions\n\nEgor Burkov 1,2\n\nVictor Lempitsky 1,2\n\n2 Skolkovo Institute of Science and Technology (Skoltech)\n\n1 Samsung AI Center\n\nMoscow, Russia\n\nAbstract\n\nBox \ufb01lters computed using integral images have been part of the computer vision\ntoolset for a long time. Here, we show that a convolutional layer that computes\nbox \ufb01lter responses in a sliding manner can be used within deep architectures,\nwhereas the dimensions and the offsets of the sliding boxes in such a layer can be\nlearned as a part of an end-to-end loss minimization. Crucially, the training process\ncan make the size of the boxes in such a layer arbitrarily large without incurring\nextra computational cost and without the need to increase the number of learnable\nparameters. Due to its ability to integrate information over large boxes, the new\nlayer facilitates long-range propagation of information and leads to the ef\ufb01cient\nincrease of the receptive \ufb01elds of network units. By incorporating the new layer\ninto existing architectures for semantic segmentation, we are able to achieve both\nthe increase in segmentation accuracy as well as the decrease in the computational\ncost and the number of learnable parameters.\n\n1\n\nIntroduction\n\nHigh-accuracy visual recognition requires integrating information from spatially-distant locations in\nthe visual \ufb01eld in order to discern and to analyze long-range correlations and regularities. Achieving\nsuch long-range integration inside convolutional networks (ConvNets), which lack feedback connec-\ntions and rely on feedforward mechanisms, is challenging. Modern ConvNets therefore combine\nseveral ideas that facilitate spatial long-range integration and allow to boost the effective size of\nreceptive \ufb01elds of the convolutional units. These ideas include stacking a very large number of layers\n(so that the local information integration effects of individual convolutional layers are accumulated)\nas well as using spatial downsampling of the representations (implemented using pooling or strides)\nearly in the pipeline.\nOne particularly useful and natural idea is the use of spatially-large \ufb01lters inside convolutional\nlayers. Generally, a naive increase of the spatial \ufb01lter size leads to the quadratic increase in the\nnumber of learnable parameters and numeric operations, leading to architectures that are both\nslow and prone to over\ufb01tting. As a result, most current ConvNets rely on small 3 \u00d7 3 \ufb01lters (or\neven smaller ones) for most convolutional layers [25]. A highly popular alternative to the naive\nenlargement of \ufb01lters is dilated/\u201c\u00e0 trous\u201d convolutions [13, 3, 33] that expand \ufb01lter sizes without\nincreasing the number of parameters by padding them with zeros. Many popular architectures,\nespecially semantic segmentation architectures that emphasize ef\ufb01ciency, rely on dilated convolutions\nextensively (e.g. [3, 33, 34, 20, 21]).\nHere, we present and evaluate a new simple approach for inserting convolutions with large (potentially\nvery large) spatial \ufb01lters inside ConvNets. The approach is based on box \ufb01ltering and relies on\nintegral images [18], as many classical works in computer vision do [31]. The new convolutional\nbox layer applies box average \ufb01lters in a convolutional manner by sliding the 2D axes-aligned boxes\nacross spatial locations, while performing box averaging at every location. The dimensions and the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\foffsets of the boxes w.r.t. the sliding coordinate frame are treated as learnable parameters of this\nlayer. The new layer therefore combines the following merits: (i) large-size convolutional \ufb01ltering,\n(ii) low number of learnable parameters, (iii) computational ef\ufb01ciency achieved via integral images,\n(iv) effective integration of spatial information over large spatial extents.\nWe evaluate the new layer by embedding it into a block that includes the new layer, as well as a\nresidual connection and a 1 \u00d7 1 convolution. We then consider semantic segmentation architectures\nthat have been designed for optimal accuracy-ef\ufb01ciency trade-offs ( E-Net [20] and ERF-Net [21]),\nand replace analogous blocks based on dilated convolutions inside those architectures with the new\nblock. We show that such replacement allows both to increase the accuracy of the network and\nto decrease the number of operations and the number of learnable parameters inside the networks.\nWe conclude that the new layer (as well as the proposed embedding block) are viable solutions for\nachieving ef\ufb01cient spatial propagation and integration of information in ConvNets as well as for\ndesigning ef\ufb01cient and accurate ConvNets.\n\n2 Related work\n\nSpatial propagation in ConvNets.\nIn order to increase receptive \ufb01elds of the convolutional neu-\nrons, and to effectively propagate/mix information across spatial locations, several high-level ideas\ncan be implemented in ConvNets (potentially, in combination). First, a ConvNet can be made\nvery deep, so that the limited spatial propagation of individual layers is accumulated. E.g. most\ntop-performing ConvNets have several dozens of convolutional layers [29, 12, 30]. Such \u201cextreme\u201d\ndepth, however, comes at a price of high computational demands, which are at odds with lots of\napplication domains, such as computer vision on low-power devices, autonomous driving etc.\nThe second idea that is invariably used in all modern architectures is downsampling, which can be\nimplemented via pooling layers [17] or simply by strided convolutions [16, 27]. Spatial shrinking\nof the representations naturally makes integration of information from different parts of the visual\n\ufb01eld easier. At the same time, excessive downsampling leads to the loss of spatial information. This\ncan be particularly problematic for such applications as semantic segmentation, where each down-\nsampling has to be complemented by upsampling layers in order to achieve pixel-level predictions.\nSpatial information cannot usually be recovered very well by downsampling-upsampling (hourglass)\narchitectures, and the addition of skip connections [19, 22] provides only partial remedy to this\nproblem [33].\nAs discussed above, dilated convolutions (also known as \u00e0 trous convolutions) [33, 3] are used\nextensively in order to expand the effective \ufb01lter sizes and to speed-up propagation of spatial\ninformation during the inference in ConvNets. Another potential approach is to introduce non-local\nnon-convolutional layers that can be regarded as the incorporation of Conditional Random Fields\n(CRF) inference steps into the network. These include layers that emulate mean \ufb01eld message\npropagation [35, 4] as well as Gaussian CRF layers [2].\nOur approach introduces yet another approach for spatial propagation/integration of information in\nConvNets that can be combined with the approaches outlined above. Our comparison with dilated\nconvolutions suggests that the new approach is competitive and may be valuable for the design of\nef\ufb01cient ConvNets.\n\nBox Filters in Computer Vision. Our approach is based on the box \ufb01ltering idea that has long\nbeen mainstream in computer vision. Through 2000s, a large number of architectures that use box\n\ufb01ltering to integrate spatial context information have been proposed. The landmark work that started\nthe trend was the Viola-Jones face detection system [31]. Later, this idea was extended to pedestrian\ndetectors [32]. Two-layer architectures that applied box \ufb01ltering on top of other transforms such\nas texton \ufb01lters or edge detectors became popular by the end of that decade [24, 9]. Box-\ufb01ltered\nfeatures remain a popular choice for building decision-tree based architectures [23, 8]. All these (and\nhundreds of other works) capitalized on the ability of box \ufb01ltering to be performed very ef\ufb01ciently\nthrough the use of the integral image trick [18].\nGiven the success of integral-image based box \ufb01ltering in the \u201cpre-deep learning era\u201d, it is perhaps\nsurprising that very few attempts have been made to insert such \ufb01ltering into ConvNets. Various\nmethods that perform sum/average pooling over large spatial boxes have been proposed for deep\nobject detection [11], semantic segmentation [34], and image retrieval [1]. All those systems,\n\n2\n\n\fhowever, apply box \ufb01ltering only at one point of the pipeline (typically towards the very end after\nthe convolutional part), do so in a non-convolutional manner, and do not rely on the integral image\ntrick since the number of boxes over which the summation is performed is usually limited. Integral\nimage-based \ufb01ltering has been applied to pool of deep features over sliding windows in [10]. Rather\ndifferently to our method, [10] use \ufb01xed prede\ufb01ned sizes of the boxes, and use average box \ufb01ltering\nonly as a penultimate layer in their network, which predicts the objectness score. In contrast, our\napproach learns the coordinates of the boxes in an end-to-end manner, and provides a generic layer\nthat can be inserted into a ConvNet architecture multiple times. Our experiments verify that learning\nbox coordinates is important for achieving good performance.\n\n3 Method\n\nThis section describes the Box Convolution layer and discusses its usage in ConvNets in detail. Note\nthat while the idea behind the new layer is rather simple (implement box averaging in a convolutional\nmanner; use integral images to speed up convolutions), an important part of our approach is to\nmake the coordinates of the boxes learnable. This requires us to consider continuous-valued box\ncoordinates, unlike the approaches in classic computer vision, which invariably consider integer-\nvalued box coordinates.\n\n3.1 Box Convolution Layer\n\nWe start by de\ufb01ning the box averaging kernel with parameters \u03b8 = (xmin, xmax, ymin, ymax) as the\nfollowing function over 2D plane R2:\n\nK\u03b8 (x, y) =\n\nI (xmin \u2264 x \u2264 xmax) I (ymin \u2264 y \u2264 ymax)\n\n(xmax \u2212 xmin) (ymax \u2212 ymin)\n\n(1)\n\nHere, xmin < xmax, ymin < ymax are the dimensions of the box averaging kernel (jointly denoted as\n\u03b8), and I is the indicator function. The kernel naturally integrates to one, and convolving a function\nwith such kernel corresponds to a low-pass averaging transform.\n\nForward Pass. The box convolution layer takes as an input the set of N convolutional maps (2D\ntensors) and applies M different box kernels to each of the N incoming channels, thus resulting in\nN M output convolutional maps (2D tensors). The layer therefore has has 4N M learnable parameters\n{\u03b8m\n\nn }N,M\n\nn=1,m=1.\n\n, where h and w are image dimensions.\nThe input maps are de\ufb01ned over a discrete lattice\nIn order to apply box averaging corresponding to continuous \u03b8 in (1) we extend each of the input\nmaps to the continuous plane. We use a piecewise-constant approximation and zero padding for this\nextension as follows:\n\n(cid:26) \u02c6I[x],[y], 1 \u2264 [x] \u2264 h and 1 \u2264 [y] \u2264 w;\n\ni=1,j=1\n\nI (x, y) =\n\notherwise\n\n(2)\nwhere [\u00b7] denotes rounding to the nearest integer. Here \u02c6I denotes one of the input channels and I\ndenotes its extension onto the plane.\nLet \u02c6O be one of the output channels corresponding to the input channel \u02c6I and let \u03b8 =\n(xmin, xmax, ymin, ymax) be the corresponding box coordinates. To \ufb01nd \u02c6O we naturally apply\nthe convolution (correlation) with the kernel (1), and then convert the output back to the discrete rep-\nresentation by sampling the result of the convolution at the lattice vertices. Overall, this corresponds\nto the following transformation:\n\n0,\n\n,\n\n(cid:17)w,h\n\n(cid:16)\u02c6Ii,j\n\n+\u221e(cid:90)\n\n+\u221e(cid:90)\n\n\u2212\u221e\n\n\u2212\u221e\n\n=\n\n1\n\n(xmax \u2212 xmin) (ymax \u2212 ymin)\n\n3\n\n\u02c6Ox,y = O (x, y) =\n\nI(x + u, y + v)K\u03b8(u, v) du dv =\n\nx+xmax(cid:90)\n\n(cid:90)\n\ny+ymax\n\nx+xmin\n\ny+ymin\n\nI(u, v) du dv ,\n\n(3)\n\n(4)\n\n\fwhere O denotes the continuous result of the convolution.\nNote that while our construction here uses zero-padding in (2), more sophisticated padding schemes\nare possible. Our experiments with such schemes, however, did not result in better performance.\n\nBackpropagation to the inputs. Since the transformation described above is a convolution (corre-\nlation), the gradient of some loss L (e.g. a semantic segmentation loss) w.r.t. its input can be obtained\nby computing the correlation of loss gradients w.r.t. its output with \ufb02ipped kernels K\u03b8 (\u2212x,\u2212y),\nwhere the contributions from the N output channels corresponding to the same input channels are\naccumulated additively:\n\nN(cid:88)\n\n+\u221e(cid:90)\n\n+\u221e(cid:90)\nGn(x + u, y + v)K\u03b8n (\u2212u,\u2212v) du dv ,\n\n\u2202L\n\u2202\u02c6Ix,y\n\n=\n\nn=1\n\n\u2212\u221e\n\n\u2212\u221e\n\n(5)\n\nwhere \u03b81, . . . , \u03b8N are the layer\u2019s parameters used to produce output channels \u02c6O1, . . . , \u02c6ON respec-\ntively, and Gn(x, y) is the continuous domain extension of \u2202L\n\n\u2202 \u02c6On as in (2).\n\nBackpropagation to the parameters \u03b8. The expression for the partial derivative\nin the Appendix and evaluates to:\n\n\u2202L\n\n\u2202xmax\n\n= \u2212\n\n1\n\n(xmax \u2212 xmin)\n\nh(cid:88)\n\nw(cid:88)\n\nx=1\n\ny=1\n\n\u2202L\n\u2202 \u02c6Ox,y\n\n\u2202L\n\n\u2202xmax\n\nis derived\n\n(6)\n\n+\n\n1\n\n(xmax \u2212 xmin) (ymax \u2212 ymin)\n\nI(x + xmax, v) dv.\n\n\u02c6Ox,y+\n\nh(cid:88)\n\nw(cid:88)\n\nx=1\n\ny=1\n\n\u2202L\n\u2202 \u02c6Ox,y\n\n(cid:90)\n\ny+ymax\n\ny+ymin\n\nPartial derivatives for xmin, ymin, ymax have analogous expressions.\n\nInitialization and regularization. Unlike many common ConvNet layers, ours does not allow\narbitrary real parameters \u03b8. Thus, during optimization we ensure positive widths and heights. During\nlearning, we ensure that the width and the height are at least \u0001 = 1 pixel: xmax \u2212 xmin > \u0001,\nymax \u2212 ymin > \u0001. These constraints are ensured in a projective gradient descent fashion. E.g. when\nthe \ufb01rst of these constraints gets violated, we change xmin and xmax by moving them away from\neach other to the distance \u0001, while preserving the midpoint. We also clip the coordinates, when they\ngo outside the [\u2212w; +w] range (for xmin and xmax) or [\u2212h; +h] range (for ymin and ymax).\n2 (cid:107)\u03b8(cid:107)2 on the parameters of all box layers,\nAdditionally, we impose a standard L2-regularization \u03bb\nthus shrinking the box dimensions towards zero at each step. Among other things, such regularization\nprevents both instabilities (often associated with some \ufb01lters growing very large) as well as the\nemergence of degenerate solutions, where boxes drift completely outside of the image boundaries for\nall locations.\nTo initialize \u03b8, we aim to diversify initial \ufb01lters, so we initialize them randomly and independently.\n\nWe \ufb01rst sample a center point of a box uniformly from the rectangle B =(cid:2)\u2212 w\nthen uniformly sample the width and the height so that [xmin; xmax] \u2282(cid:2)\u2212 w\n(cid:2)\u2212 h\n\n(cid:3) \u00d7(cid:2)\u2212 h\n(cid:3) and\n(cid:3) and [ymin; ymax] \u2282\n(cid:3). Such choice of initial parameters ensures enough potentially non-zero output pixels in the\n\n2 ; w\n\n2 ; w\n\n2 ; h\n\n2\n\n2\n\n2\n\n2 ; h\n\n2\n\nresulting layer (which leads to strong enough gradient during learning).\n\nFast computation via integral images. The forward-pass computation (4) as well as the\nbackward-pass computations (5) and (6) involve integration over 1D axes aligned intervals\nand 2D axes aligned boxes. To enable fast computation, we split each interval of the inte-\ngration into the integral part and the two fractional parts in the end. As a result all integrals\ncan be approximated using box sums and line sums. For example, if xmax \u2212 xmin > 1,\n\u02c6I(i, j) plus a\n\u02c6I[x+xmin],j,\n\nthen the integral in (4) can be estimated as a box sum (cid:80)[x+xmax]\nweighted sum of four line sums, e.g. (cid:0)[x + xmax] + 1\n\n(cid:80)[y+ymax]\n(cid:1)(cid:80)[y+ymax]\n\n2 \u2212 x \u2212 xmax\n\nj=[y+ymin]\n\ni=[x+xmin]\n\nj=[y+ymin]\n\n4\n\n\fFigure 1: The block architecture that is used to embed box convolution into our architectures. For the\nsake of simplicity, the speci\ufb01c narrowing factor of 4 is used. The block combines 1 \u00d7 1 convolution\nthat shrinks the number of channels from N to N/4, and then box convolution that increases the\nnumber of channels from N/4 to N.\n\nplus\n\nfour\n\n(cid:0)[x + xmax] + 1\n(cid:1) \u02c6I[x+xmax],[y+ymax].\nand line sums can then be handled ef\ufb01ciently using the integral image I(cid:82) (x, y) =(cid:80)\nThe backprop step is handled analogously, as I(cid:82) is reused to compute (6), and the integral image for\n\n(cid:1)(cid:0)[y + ymax] + 1\n\nto\n2 \u2212 y \u2212 ymax\n\nscalars\n2 \u2212 x \u2212 xmax\n\n(cid:80)\n\ne.g.\nBox sums\n\u02c6Ix,y.\n\nj\u2264y\n\ncorresponding\n\nthe\n\ncorners\n\nof\n\nthe\n\ndomain,\n\ni\u2264x\n\n\u2202L\n\nis computed to evaluate the integrals in (5).\n\n\u2202 \u02c6Ox,y\nIntegral image computation on GPU is performed by \ufb01rst performing parallel cumulative sum\ncomputation over columns, then transposing the result and accumulating over the new columns\n(former rows) again, and \ufb01nally transposing back.\n\nEmbedding box convolutions into an architecture. The derived box convolution layer acts on\neach of the input channels independently. It is therefore natural to interleave box convolutions\nwith cross-channel 1 \u00d7 1 convolution, making the resulting block in many ways similar to the\ndepthwise separable convolution block [5]. We further expand the block by inserting the standard\ncomponents, namely ReLU non-linearities, batch normalizations [14], a dropout layer [28] and a\nresidual connection [12]. The resulting block architecture is shown in Figure 1. The block transforms\nan input stack of N channels into the same-size stack of N channels. Notably, the majority of the\nlearnable parameters (and the majority of the \ufb02oating-point operations) are in the 1 \u00d7 1 convolution,\nwhich has O(N 2) complexity per each spatial position. All remaining layers including the box\nconvolution layer, which acts within each channel independently, have O(N ) complexity per spatial\nlocation.\n\n4 Experiments\n\nWe evaluate and analyze the performance of the new layer for the task of semantic segmentation,\nwhich is among vision tasks that are most sensitive to the ability to propagate contextual evidence\nand to preserve spatial information. Also, convolutional architectures for semantic segmentation have\nbeen studied extensively over the last several years, resulting in strong baseline architectures.\n\nBase architectures and datasets. To perform the evaluation of the new layer and the embedding\nblock, we consider two base architectures for semantic segmentation: ENet [20] and ERFNet [21].\nBoth architectures have been designed to be accurate and at the same time very ef\ufb01cient. They both\nconsist of similar residual blocks and feature dilated convolutions. In our evaluation, we replace\nseveral of such blocks with the new block (Figure 1).\nBoth ENet and ERFNet have been \ufb01ne-tuned to perform well on the Cityscapes dataset for autonomous\ndriving [7]. The dataset (\u201c\ufb01ne\u201d version) consists of 2975 training, 500 validation and 1525 test images\nof urban environments, manually annotated with 19 classes. This dataset represents one of the main\nbenchmarks for semantic segmentation, with a very large number of evaluated architectures. The ENet\narchitecture has also been tuned for the SUN RGB-D dataset [26], which is a popular benchmark for\nindoors semantic segmentation and consists of 5050 train and 5285 test images, providing annotation\nfor 37 object and stuff classes. Following the original paper [20], we train all architectures on RGB\ndata only ignoring depth.\n\n5\n\nInput1x1 ConvBatch NormBox ConvReLUBatch NormDropoutReLUAdd ShortcutNNNNNNN/4N/4N/4OutputN\fENet BoxENet BoxENet\u2020 ENet\u2212 BoxOnlyENet ENet BoxENet BoxOnlyENet\n\nIoU-class, % 59.4\nIoU-categ., % 81.8\n\n64.6\n83.2\n\n61.0\n81.8\n\n58.3\n80.4\n\nValidation set\n\n54.1\n60.3\n80.9\n80.9\nValidation set\n\nTest set\n\n64.7\n83.8\nTest set\n\n61.8\n82.1\n\nERFNet BoxERFNet BoxERFNet\u2020 ERFNet\u2212 ERFNet BoxERFNet\n\nIoU-class, % 68.8\nIoU-categ., % 85.3\n\n69.0\n85.4\n\n63.6\n84.1\n\n59.8\n84.1\n\n68.0\n86.5\n\n68.1\n85.6\n\nTable 1: Results for ENet-based models (top) and ERFNet-based models (bottom) on the Cityscapes\ndataset. For ENet con\ufb01gurations, BoxENet considerably outperforms ENet as well as the ablations.\nFor ERFNet, the version with boxes performs on par, while requiring less resources (Table 3). See\ntext for more discussion.\n\nmean IoU\n\nENet BoxENet BoxENet\u2020 ENet\u2212 ERFNet BoxERFNet\n22.9% 24.5%\nClass accuracy 34.0% 37.2%\nPixel accuracy 66.5% 67.1%\n\n13.2% 25.3%\n19.0% 36.1%\n59.7% 68.6%\n\n28.7%\n41.9%\n69.0%\n\n21.9%\n32.2%\n64.2%\n\nTable 2: Performance on the test set of the SUN RGB-D dataset (all architectures disregarded the\ndepth channel). The networks with box convolutions perform better than the base architectures and\nthan the ablations. In the case of ERFNet family, BoxERFNet also requires less resources than\nERFNet (Table 3). See text for more discussion.\n\nNew architectures. We design two new architectures, namely BoxENet based on ENet and Box-\nERFNet based on ERFNet. Both base architectures contain downsampling (based on strided con-\nvolution) and upsampling (based on upconvolution) layer groups. Between them, a sequence of\nresidual blocks is inserted. For instance, ENet authors rely on bottleneck residual blocks, sometimes\nemploying dilation in its 3 \u00d7 3 convolution, or replacing it by two 1D convolutions (also possibly\ndilated). It has four of these after the second downsampling group, 16 after the third group, two after\nthe \ufb01rst upsampling and one after the second upsampling.\nWhen designing BoxENet, we replace every second block in all these sequences by our proposed\nblock, additionally replacing the very last block as well. As a matter of fact, in this way we replace\nall blocks with dilated convolutions, so that BoxENet does not have any dilated convolutions at all.\nThe comparison between ENet and BoxENet thus effectively pits dilated convolutions versus box\nconvolutions. We have further evaluated the con\ufb01guration where all resolution-preserving blocks\nare replaced by our proposed block (BoxOnlyENet). We use the same bottleneck narrowing factor\n4 (same as in the original ENet) as in the original blocks except for the very last block, where we\nsqueeze N channels to N/2. The dropout rate is set to 0.25 where the feature map resolution is\nlowest (1/8th of the input), and to 0.15 elsewhere.\nERFNet has a similar design to ENet, although the blocks in it operate at the full number of channels.\nHere, we implemented the above pattern with the following changes: (i) in the \ufb01rst resolution-\npreserving block sequence, only one original block is kept, (ii) from the last two sequences, one of\nthe two blocks is simply removed without replacing it by our block. All our blocks have a narrowing\nfactor of 4 (keeping this from the ENet experiments). This time we always use the exact same\n(possibly zero) dropout rate as in the corresponding original replaced block. In addition, we remove\ndilation from all the remaining blocks.\n\nAblations. The key aspect of the new box convolution layer is the learnability of the box coordinates.\nTo assess the importance of this aspect, we have evaluated the ablated architectures BoxENet\u2020 and\nBoxERFNet\u2020 that are identical to BoxENet and BoxERFNet, but have the coordinates of the boxes in\nthe box convolution layers frozen at initialization. Another pair of ablations, ENet\u2212 and ERFNet\u2212\nare the modi\ufb01cation that have the corresponding residual blocks removed rather than replaced with\nthe new block (these ablations were added to make sure that the increase in accuracy were not coming\nsimply through the reduction of learnable parameters).\n\nPerformance. We have used standard learning practices (ADAM optimizer [15], step learning rate\npolicy with the same learning rate as in the original papers [20, 21]) and standard error measures.\n\n6\n\n\fMultAdds, billions\n\nENet BoxENet ENet\u2212 BoxOnlyENet ERFNet BoxERFNet ERFNet\u2212\n4.601\n15.827\n558\n32.8\n# of params, millions 0.356\n\n30.876\n1682\n78.4\n2.059\n\nCPU time, ms\nGPU time, ms\n\n16.042\n\n822\n44.0\n1.040\n\n3.468\n478\n33.1\n0.243\n\n2.950\n338\n21.2\n0.201\n\n2.290\n414\n33.6\n0.124\n\n772\n39.7\n1.020\n\nTable 3: Resource costs for the architectures. Timings are averaged over 40 runs. See text for\ndiscussion.\n\nFigure 2: Change of the box coordinates of the \ufb01rst box convolution layer in BoxENet as the training\non Cityscapes progresses. The boxes with the biggest importance (at the end of training) are shown.\n\nOnce box-based architectures are trained, we round the box coordinates to the nearest integers, \ufb01x\nthem and \ufb01ne-tune the network, so that at test time we do not have to deal with fractional parts of\nintegration intervals. Such rounding post-processing resulted in slightly lower runtimes (5.7% and\n1.3% speedup for BoxENet and BoxERFNet respectively) with essentially same performance. Our\napproach was implemented using Torch7 library [6].\nThe comparison on the validation set of the Cityscapes dataset are given in Table 1. We also report\nthe performance of non-ablated variants on the test set. Table 2 further reports the comparison on\nthe test set of the SUN RGB-D dataset (note that ENet accuracy on SUN RGB-D is higher than\nreported in the original paper due to the higher input image resolution). The new architectures\n(BoxENet/BoxERFNet) outperform the counterparts (ENet/ERFNet) considerably for the ENet\nfamily on both datasets and for the ERFNet family on the SUN-RGBD dataset. On the Cityscapes\ndataset, BoxERFNet achieves similar accuracy to ERFNet, however it has strong advantage in terms\nof computational resources (see below). BoxOnlyENet performs in-between ENet and BoxENet,\nsuggesting that the optimal approach might be to combine standard 3x3 convolutions with box\nconvolutions (as is done in BoxENet). Finally, \ufb01xing the bounding box coordinates to random\ninitializations (BoxENet\u2020 and BoxERFNet\u2020) lead to signi\ufb01cant degradation in accuracy, suggesting\nthat the learnability of box coordinates is crucial.\nWe further compare the number of operations, the GPU and CPU inference times on a laptop (an\nNVIDIA GTX 1050 GPU with cuDNN 7.2, a single core of Intel i7-7700HQ CPU), and the number\nof learnable parameters in Table 3. The new architectures are more ef\ufb01cient in terms of the number\nof operations, and also in terms of GPU timings for the ERFNet case. The GPU computation time is\nmarginally bigger for BoxENet compared to ENet despite BoxENet incurring fewer operations, as the\nlevel of optimization of our code does not quite reach the level of cuDNN kernels. On CPU, the new\narchitectures are considerably faster (by a factor of two in the ERFNet case). Finally, the box-based\narchitectures have considerably fewer learnable parameters compared to the base architecture.\n\nBox statistics. To analyze the learning of box coordinates, we introduce the measure of box\nimportance. For each box \ufb01lter, its importance is de\ufb01ned as the average absolute weight of the\ncorresponding channel in the subsequent 1 \u00d7 1 convolution multiplied by the maximum absolute\nweight corresponding to the input channel in the preceding convolution.\nThe evolution of boxes during learning is shown in Figure 2. We observe the following trends. Firstly,\nunder the imposed regularization, a certain subset of boxes shrinks to minimal width and height.\nDetecting such boxes and eliminating them from the network is then possible at this point. While this\nshould lead to improved computational ef\ufb01ciency, we do not pursue this in our experiments. Another\nobvious trend is the emergence of boxes that are symmetric w.r.t. the vertical axis. We observe\nthis phenomenon to be persistent across layers, architectures and datasets. The effect persists even\nwhen horizontal \ufb02ip augmentations are switched off during training. All this probably suggesting\nthat a three-DOF parameterization \u03b8 = {ymin, ymax, width} can be used in place of the four-DOF\nparameterization.\n\n7\n\nInitial4.8k iterations16k iterations28k iterations56k iterationsConverged\fFigure 3: The vertical axis shows the areas of box \ufb01lters learned on Cityscapes by the BoxENet\nnetwork (note the log-scale). Colors depict different layers. The learned network contains very large\n(>10000 pixels) boxes and lots of boxes spanning many hundreds of pixels. Using \ufb01lters of this size\nin a conventional ConvNet would be impractical from the computational and statistical viewpoints.\n\nFigure 4: The curves visualizing the effective receptive \ufb01elds of the output units of different architec-\ntures (left) and the shallow sub-networks of the BoxENet and ENet architectures, speci\ufb01cally \ufb01rst 7\n(i.e. up to the \ufb01rst box convolution block) blocks of BoxENet and their ENet counterpart with the\noriginal 7th block (right). The models have been trained on the Cityscapes dataset. The effective\nreceptive \ufb01eld can be thought as the radius (horizontal position) at which the curve saturates to 1 (see\ntext for the details of the curve computation). Architectures with box convolutions have much bigger\neffective receptive \ufb01elds.\n\nThe \ufb01nal state of the BoxENet trained on the Cityscapes dataset is visualized in Figure 3. The\nscatterplot shows the sizes (areas) of boxes in different layers plotted against the importances\ndetermined by the the activation strength of the input map and the connection strength to the\nsubsequent layers. The network has converged to a state with a lot of large and very large box \ufb01lters,\nand most of the very large \ufb01lters are important for the subsequent processing. We conclude that\nthe resulting network relies heavily on box averaging over large and very large extents. A standard\nConvNet with similarly-sized \ufb01lters of a general form would not be practical, as it would be too slow\nand the number of parameters would be too high.\n\n(cid:80)Nk\ni=1 (cid:107)\u2202yi(p, q)/\u2202\u02c6I0\n\nReceptive \ufb01elds analysis. We have analyzed the impact of new layers onto the effective size of\nthe receptive \ufb01elds. For this reason we have used the following measure. Given a spatial position\n(p, q) inside the visual \ufb01eld, and a layer k, we consider the gradient response map \u02c6M(p, q, k,\u02c6I0) =\nx,y(cid:107)}x,y, where yi(p, q) is the unit at level k of the network, corresponding\n{ 1\nNk\nto the i-th map at position (p, q), \u02c6I0 is the input image, (x, y) is the position in the input image.\nThe map \u02c6M(p, q, k,\u02c6I0) estimates the in\ufb02uence of various positions in the input image on the activation\nof the unit in the k-th channel at position (p, q). The bigger the effective receptive \ufb01elds of the\nunits in layer k in the network, the more spread-out will be the map \u02c6M(p, q, k,\u02c6I0) around positions\n(p, q). We measure the spread by taking a random image from the Cityscapes dataset, considering a\nrandom location (p, q) in the visual \ufb01elds, and computing the gradient response map \u02c6M(p, q, k,\u02c6I0)\nfor a certain level k of the network. For each computed map, we then consider a family of square\nwindows of varying radius r centered at (p, q), integrate the map value over each window, and divide\nthe integral over the total integral of the gradient response map, obtaining the value E(r) between 0\n\n8\n\n106105104103102Box filter importance110100100010000Box area, pixels3\u00d73 convolution025050075010001250150017502000Pixel neighbourhood size0.00.20.40.60.81.0Cumulative gradient energyENetERFNetBoxENetBoxERFNet025050075010001250150017502000Pixel neighbourhood sizeENetBoxENet\fand 1 (we refer to this value as cumulative gradient energy). One can think of the value r at which\nE(r) comes close to 1 as the radius of an effective receptive \ufb01eld (since the corresponding window\ncontains nearly all gradients).\nWe then consider the curves E(r), average them over 30 images and 20 random positions (p, q). The\ncurves for the \ufb01nal layer (prior to softmax) and an early layer (after the \ufb01rst seven convolutions) are\nshown in Figure 4. For networks with box convolutions, the cumulative curves saturate to 1 at much\nhigher radii r than for the networks relying on dilated convolutions (effectively for some locations\nand some images the effective receptive \ufb01eld spans the whole image). Overall, we conclude that\nnetworks with box convolutions have much bigger effective receptive \ufb01elds, both for units in early\nlayers as well as for the output units.\n\n5 Summary\n\nWe have introduced a new convolutional layer that computes box \ufb01lter responses at every location,\nwhile optimizing the parameters (coordinates) of the boxes within the end-to-end learning process.\nThe new layer therefore combines the ability to aggregate information over large areas, the low\nnumber of learnable parameters, and the computational ef\ufb01ciency achieved via the integral image\ntrick. We have shown that the learning process indeed leads to large boxes within the new layer, and\nthat the incorporation of the new layer increases the receptive \ufb01elds of the units in the middle of\nsemantic segmentation networks very considerably, explaining the improved segmentation accuracy.\nWhat is more, this increase in accuracy comes alongside the reduction of the number of operations.\nThe code of the new layer, as well as the implementation of BoxENet and BoxERFNet architectures,\nare available at the project website (https://github.com/shrubb/box-convolutions).\n\nAppendix\n\nHere, we derive the backpropagation equation for\n\n. We start with the chain rule that suggests:\n\n\u2202L\n\n\u2202xmax\n\n=\n\n\u2202L\n\u2202 \u02c6Ox,y\n\n\u00b7 \u2202 \u02c6Ox,y\n\u2202xmax\n\n.\n\n(7)\n\n\u2202L\n\n\u2202xmax\n\nh(cid:88)\n\nw(cid:88)\n\nx=1\n\ny=1\n\n(cid:21)\n\n(cid:20)\n\nTo compute the derivative of \u02c6Ox,y over xmax, we use the product rule treating (4) as a product of a\nratio coef\ufb01cient and a double integral. The derivative of the former evaluates to:\n\n\u2202\n\n1\n\n(xmax \u2212 xmin) (ymax \u2212 ymin)\n\n/\u2202xmax = \u2212\n\n1\n\n(xmax \u2212 xmin)2 (ymax \u2212 ymin)\n\n.\n\n(8)\n\nThe double integral in (4) has xmax as one of its limits, and its derivative therefore evaluates to:\n\n\uf8ee\uf8f0 x+xmax(cid:90)\n\n\u2202\n\n(cid:90)\n\ny+ymax\n\n\uf8f9\uf8fb/\u2202xmax =\n\n(cid:90)\n\ny+ymax\n\nx+xmin\n\ny+ymin\n\ny+ymin\n\nI(u, v) du dv\n\nI(x + xmax, v) dv.\n\n(9)\n\nPlugging both (8) and (9) into the product rule for expression (4), we get:\n\n\u2202 \u02c6Ox,y\n\u2202xmax\n\n=\n\n1\n\n(xmax \u2212 xmin)\n\n1\n\n(ymax \u2212 ymin)\n\nI(x + xmax, v) dv\n\n\uf8eb\uf8ed\u2212 \u02c6Ox,y +\n\n(cid:90)\n\ny+ymax\n\ny+ymin\n\n\uf8f6\uf8f8 (10)\n\nwhich allows us to expand (7) as (6).\n\nAcknowledgement. Most of the work was done when both authors were full-time with Skolkovo\nInstitute of Science and Technology. The work was supported by the Ministry of Science of Russian\nFederation grant 14.756.31.0001.\n\n9\n\n\fReferences\n[1] A. Babenko and V. Lempitsky. Aggregating local deep features for image retrieval. Proc. ICCV,\n\npp. 1269\u20131277, 2015.\n\n[2] S. Chandra and I. Kokkinos. Fast, exact and multi-scale inference for semantic image segmenta-\n\ntion with deep gaussian CRFs. Proc. ECCV, pp. 402\u2013418. Springer, 2016.\n\n[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic\nimage segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs.\nT-PAMI, 40(4):834\u2013848, 2018.\n\n[4] L.-C. Chen, A. Schwing, A. Yuille, and R. Urtasun. Learning deep structured models. Proc.\n\n[5] F. Chollet. Xception: Deep learning with depthwise separable convolutions. Proc. CVPR, pp.\n\nICML, pp. 1785\u20131794, 2015.\n\n1251\u20131258, 2017.\n\n[6] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine\n\nlearning. BigLearn, NIPS Workshop, 2011.\n\n[7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth,\nand B. Schiele. The cityscapes dataset for semantic urban scene understanding. Proc. CVPR,\npp. 3213\u20133223, 2016.\n\n[8] A. Criminisi, J. Shotton, E. Konukoglu, et al. Decision forests: A uni\ufb01ed framework for\nclassi\ufb01cation, regression, density estimation, manifold learning and semi-supervised learning.\nFoundations and Trends R(cid:13) in Computer Graphics and Vision, 7(2\u20133):81\u2013227, 2012.\n\n[9] P. Doll\u00e1r, Z. Tu, P. Perona, and S. Belongie. Integral channel features. Proc. BMVC. BMVC\n\nPress, 2009.\n\n[10] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and L. Van Gool. DeepProposals: Hunting\n\nobjects and actions by cascading deep convolutional layers. IJCV, 124(2):115\u2013131, 2017.\n\n[11] R. Girshick. Fast R-CNN. Proc. ICCV, pp. 1440\u20131448. IEEE, 2015.\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Proc. CVPR,\n\npp. 770\u2013778, 2016.\n\n[13] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm\nfor signal analysis with the help of the wavelet transform. Wavelets, pp. 286\u2013297. Springer,\n1990.\n\n[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. Proc. ICML, pp. 448\u2013456, 2015.\n\n[15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[16] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.\nJackel. Backpropagation applied to handwritten zip code recognition. Neural computation,\n1(4):541\u2013551, 1989.\n\n[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[18] J. P. Lewis. Fast template matching. Vision interface, v. 95, pp. 15\u201319, 1995.\n[19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\n\nProc. CVPR, pp. 3431\u20133440, 2015.\n\n[20] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: A deep neural network architecture\n\nfor real-time semantic segmentation. CoRR, abs/1606.02147, 2016.\n\n[21] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo. Erfnet: Ef\ufb01cient residual factorized\nconvnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation\nSystems, 19(1):263\u2013272, 2018.\n\n[22] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image\n\nsegmentation. Proc. MICCAI, pp. 234\u2013241. Springer, 2015.\n\n[23] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and\n\nsegmentation. Proc. CVPR, pp. 1\u20138. IEEE, 2008.\n\n[24] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and\ncontext modeling for multi-class object recognition and segmentation. Proc. ECCV, pp. 1\u201315.\nSpringer, 2006.\n\n[25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. CoRR, abs/1409.1556, 2014.\n\n10\n\n\f[26] S. Song, S. P. Lichtenberg, and J. Xiao. SUN RGB-D: A RGB-D scene understanding benchmark\n\nsuite. Proc. CVPR, v. 5, p. 6, 2015.\n\n[27] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all\n\nconvolutional net. arXiv preprint arXiv:1412.6806, 2014.\n\n[28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple\n\nway to prevent neural networks from over\ufb01tting. JMLR, 15(1):1929\u20131958, 2014.\n\n[29] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks.\n\narXiv preprint\n\narXiv:1505.00387, 2015.\n\n[30] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the\n\nimpact of residual connections on learning. AAAI, v. 4, p. 12, 2017.\n\n[31] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Proc.\n\nCVPR. IEEE, 2001.\n\n[32] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion and\n\nappearance. IJCV, 63(2):153\u2013161, 2005.\n\n[33] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. Proc. ICLR, 2015.\n[34] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. Proc. CVPR, pp.\n\n2881\u20132890, 2017.\n\n[35] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr.\n\nConditional random \ufb01elds as recurrent neural networks. Proc. ICCV, pp. 1529\u20131537, 2015.\n\n11\n\n\f", "award": [], "sourceid": 3047, "authors": [{"given_name": "Egor", "family_name": "Burkov", "institution": "Samsung"}, {"given_name": "Victor", "family_name": "Lempitsky", "institution": "Samsung"}]}