{"title": "FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 754, "page_last": 764, "abstract": "The basic principles in designing convolutional neural network (CNN) structures for predicting objects on different levels, e.g., image-level, region-level, and pixel-level, are diverging. Generally, network structures designed specifically for image classification are directly used as default backbone structure for other tasks including detection and segmentation, but there is seldom backbone structure designed under the consideration of unifying the advantages of networks designed for pixel-level or region-level predicting tasks, which may require very deep features with high resolution. Towards this goal, we design a fish-like network, called FishNet. In FishNet, the information of all resolutions is preserved and refined for the final task. Besides, we observe that existing works still cannot \\emph{directly} propagate the gradient information from deep layers to shallow layers. Our design can better handle this problem. Extensive experiments have been conducted to demonstrate the remarkable performance of the FishNet. In particular, on ImageNet-1k, the accuracy of FishNet is able to surpass the performance of DenseNet and ResNet with fewer parameters. FishNet was applied as one of the modules in the winning entry of the COCO Detection 2018 challenge. The code is available at https://github.com/kevin-ssy/FishNet.", "full_text": "FishNet: A Versatile Backbone for Image, Region,\n\nand Pixel Level Prediction\n\nShuyang Sun1, Jiangmiao Pang3, Jianping Shi2, Shuai Yi2, Wanli Ouyang1\n\n1The University of Sydney 2SenseTime Research 3Zhejiang University\n\nshuyang.sun@sydney.edu.au\n\nAbstract\n\nThe basic principles in designing convolutional neural network (CNN) structures\nfor predicting objects on different levels, e.g., image-level, region-level, and pixel-\nlevel, are diverging. Generally, network structures designed speci\ufb01cally for image\nclassi\ufb01cation are directly used as default backbone structure for other tasks includ-\ning detection and segmentation, but there is seldom backbone structure designed\nunder the consideration of unifying the advantages of networks designed for pixel-\nlevel or region-level predicting tasks, which may require very deep features with\nhigh resolution. Towards this goal, we design a \ufb01sh-like network, called FishNet.\nIn FishNet, the information of all resolutions is preserved and re\ufb01ned for the \ufb01nal\ntask. Besides, we observe that existing works still cannot directly propagate the\ngradient information from deep layers to shallow layers. Our design can better\nhandle this problem. Extensive experiments have been conducted to demonstrate\nthe remarkable performance of the FishNet. In particular, on ImageNet-1k, the\naccuracy of FishNet is able to surpass the performance of DenseNet and ResNet\nwith fewer parameters. FishNet was applied as one of the modules in the win-\nning entry of the COCO Detection 2018 challenge. The code is available at\nhttps://github.com/kevin-ssy/FishNet.\n\n1\n\nIntroduction\n\nConvolutional Neural Network (CNN) has been found to be effective for learning better feature\nrepresentations in the \ufb01eld of computer vision [17, 26, 28, 9, 37, 27, 4]. Thereby, the design of CNN\nbecomes a fundamental task that can help to boost the performance of many other related tasks. As\nthe CNN becomes increasingly deeper, recent works endeavor to re\ufb01ne or reuse the features from\nprevious layers through identity mappings [8] or concatenation [13].\nThe CNNs designed for image-level, region-level, and pixel-level tasks begin to diverge in network\nstructure. Networks for image classi\ufb01cation use consecutive down-sampling to obtain deep features\nof low resolution. However, the features with low resolution are not suitable for pixel-level or even\nregion-level tasks. Direct use of high-resolution shallow features for region and pixel-level tasks,\nhowever, does not work well. In order to obtain deeper features with high resolution, the well-known\nnetwork structures for pixel-level tasks use U-Net or hourglass-like networks [22, 24, 30]. Recent\nworks on region-level tasks like object detection also use networks with up-sampling mechanism\n[21, 19] so that small objects can be described by the features with relatively high resolution.\nDriven by the success of using high-resolution features for region-level and pixel-level tasks, this\npaper proposes a \ufb01sh-like network, namely FishNet, which enables the features of high resolution to\ncontain high-level semantic information. In this way, features pre-trained from image classi\ufb01cation\nare more friendly for region and pixel level tasks.\nWe carefully design a mechanism that have the following three advantages.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: The up/down-sampling block for ResNet (left), and FishNet (right). The 1 \u00d7 1 convolution\nlayer in yellow indicates the Isolated convolution (I-conv, see Section 2), which makes the direct BP\nincapable and degrades the gradient from the output to shallow layers.\n\nFirst, it is the \ufb01rst backbone network that uni\ufb01es the advantages of networks designed for pixel-\nlevel, region-level, and image-level tasks. Compared to the networks designed purely for the image\nclassi\ufb01cation task, our network as a backbone is more effective for pixel-level and region-level tasks.\nSecond, it enables the gradient from the very deep layer to be directly propagated to shallow\nlayers, called direct BP in this paper. Recent works show that there are two designs that can\nenable direct BP, identity mapping with residual block [8] and concatenation [13]. However, the\nuntold fact is that existing network designs, e.g. [9, 8, 13, 28, 34, 32], still do not enable direct BP.\nThis problem is caused by the convolutional layer between features of different resolutions. As shown\nin the Figure 1, the ResNet [9] utilize a convolutional layer with stride on the skip connection to deal\nwith the inconsistency between the numbers of input and output channels, which makes the identity\nmapping inapplicable. Convolution without identity mapping or concatenation degrades the gradient\nfrom the output to shallow layers. Our design better solves this problem by concatenating features of\nvery different depths to the \ufb01nal output. We also carefully design the components in the network to\nensure the direct BP. With our design, the semantic meaning of features are also preserved throughout\nthe whole network.\nThird, features of very different depth are preserved and used for re\ufb01ning each other. Features\nwith different depth have different levels of abstraction of the image. All of them should be kept to\nimprove the diversity of features. Because of their complementarity, they can be used for re\ufb01ning\neach other. Therefore, we design a feature preserving-and-re\ufb01ning mechanism to achieve this goal.\nA possibly counter-intuitive effect of our design is that it performs better than traditional convolutional\nnetworks in the trade-off between the number of parameters and accuracy for image classi\ufb01cation.\nThe reasons are as follows: 1) the features preserved and re\ufb01ned are complementary to each other\nand more useful than designing networks with more width or depth; and 2) it facilitates the direct BP.\nExperimental results show that our compact model FishNet-150, of which the number of parameters\nis close to ResNet-50, is able to surpass the accuracy of ResNet-101 and DenseNet-161(k=48) on\nImageNet-1k. For region and pixel level tasks like object detection and instance segmentation, our\nmodel as a backbone for Mask R-CNN [10] improves the absolute AP by 2.8% and 2.3% respectively\non MS COCO compared to the baseline ResNet-50.\n\n1.1 Related works\n\nCNN architectures for image classi\ufb01cation. The design of deep CNN architecture is a fundamental\nbut challenging task in deep learning. Networks with better design extract better features, which can\nboost the performance of many other tasks. The remarkable improvement in the image recognition\nchallenge ILSVRC [25] achieved by AlexNet [17] symbolizes a new era of deep learning for computer\nvision. After that, a number of works, e.g. VGG [26], Inception [28], all propose to promote the\nnetwork capability by making the network deeper. However, the network at this time still cannot be\ntoo deep because of the problem of vanishing gradient. Recently, the problem of vanishing gradient\nis greatly relieved by introducing the skip connections into the network [9]. There is a series of\non-going works on this direction [29, 34, 32, 13, 2, 11, 31, 33]. However, among all these networks\ndesigned for image classi\ufb01cation, the features of high resolution are extracted by the shallow layers\nwith small receptive \ufb01eld, which lack the high-level semantic meaning that can only be obtained on\n\n2\n\n1\u00d71,\ud835\udc50\ud835\udc56\ud835\udc5b3\u00d73,\ud835\udc50\ud835\udc56\ud835\udc5b1\u00d71,\ud835\udc50\ud835\udc5c\ud835\udc62\ud835\udc611\u00d71,\ud835\udc50\ud835\udc5c\ud835\udc62\ud835\udc61\ud835\udc50\ud835\udc5c\ud835\udc62\ud835\udc61\ud835\udc46\ud835\udc61\ud835\udc5f\ud835\udc56\ud835\udc51\ud835\udc52=21\u00d71,\ud835\udc50\ud835\udc56\ud835\udc5b3\u00d73,\ud835\udc50\ud835\udc56\ud835\udc5b1\u00d71,\ud835\udc50\ud835\udc56\ud835\udc5bCLow-level features \ud835\udc50\ud835\udc5c\ud835\udc62\ud835\udc61\u2212\ud835\udc50\ud835\udc56\ud835\udc5b\ud835\udc50\ud835\udc5c\ud835\udc62\ud835\udc61CConcat\ud835\udc62\ud835\udc5d/\ud835\udc51\ud835\udc5c\ud835\udc64\ud835\udc5b\ud835\udc60\ud835\udc4e\ud835\udc5a\ud835\udc5d\ud835\udc59\ud835\udc52\fFigure 2: Overview of the FishNet. It has three parts. Tail uses existing works to obtain deep\nlow-resolution features from the input image. Body obtains high-resolution features of high-level\nsemantic information. Head preserves and re\ufb01nes the features from the three parts.\n\ndeeper layers. Our work is the \ufb01rst to extract high-resolution deep feature with high-level semantic\nmeaning and improve image classi\ufb01cation accuracy at the same time.\nDesign in combining features from different layers. Features from different resolution or depth\ncould be combined using nested sparse networks [16], hyper-column [7], addition [18] and residual\nblocks [22, 21](conv-deconv using residual blocks). Hyper-column networks directly concatenate\nfeatures from different layers for segmentation and localization in [7]. However, features from\ndeep layers and shallow layers were not used for re\ufb01ning each other. Addition [18] is a fusion of\nthe features from deep and shallow layers. However, addition only mix the features of different\nabstraction levels, but cannot preserve or re\ufb01ne both of them. Concatenation followed by convolution\nis similar to addition [33]. When residual blocks [22, 21], also with addition, are used for combining\nfeatures, existing works have a pre-de\ufb01ned target to be re\ufb01ned. If the skip layer is for the deep\nfeatures, then the shallow features serve only for re\ufb01ning the deep features, which will be discarded\nafter the residual blocks in this case. In summary, addition and residual blocks in existing works do\nnot preserve features from both shallow and deep layers, while our design preserves and re\ufb01nes them.\nNetworks with up-sampling mechanism. As there are many other tasks in computer vision, e.g.\nobject detection, segmentation, that require large feature maps to keep the resolution, it is necessary\nto apply up-sampling methods to the network. Such mechanism often includes the communication\nbetween the features with very different depths. The series of works including U-Net [24], FPN [21],\nstacked hourglass [22] etc., have all shown their capability in pixel-level tasks [22] and region-level\ntasks [21, 19]. However, none of them has been proven to be effective for the image classi\ufb01cation task.\nMSDNet [12] tries to keep the feature maps with large resolution, which is the most similar work\nto our architecture. However, the architecture of MSDNet still uses convolution between features\nof different resolutions, which cannot preserve the representations. Besides, it does not provide an\nup-sampling pathway to enable features with large resolution and more semantic meaning. The aim\nof MSDNet introducing the multi-scale mechanism into its architecture is to do budget prediction.\nSuch design, however, did not show improvement in accuracy for image classi\ufb01cation. Our FishNet\nis the \ufb01rst in showing that the U-Net structure can be effective for image classi\ufb01cation. Besides, our\nwork preserves and re\ufb01nes features from both shallow and deep layers for the \ufb01nal task, which is not\nachieved in existing networks with up-sampling or MSDNet.\nMessage passing among features/outputs. There are some approaches using message passing\namong features for segmentation [36], pose estimation [3] and object detection [35]. These designs\nare based on backbone networks, and the FishNet is a backbone network complementary to them.\n\n3\n\n224x224\u202656x5628x2814x147x714x1428x2856x5628x2814x147x71x1\u2026\u2026\u2026\u2026\u2026\u2026\u2026Features in the tail partFeatures in the body partResidual BlocksFeatures inthe head partConcatFish TailFish BodyFish Head\u2026\u2026\u2026\fLs(cid:88)\n\nl=1\n\nLs(cid:88)\n\nl=1\n\n2\n\nIdentity Mappings in Deep Residual Networks and Isolated Convolution\n\nThe basic building block for ResNet is called the residual block. The residual blocks with identity\nmapping [8] can be formulated as\n\n(1)\nwhere xl denotes the input feature for the residual block at layer l, and F(xl, Wl) denotes the residual\nfunction with input xl and parameters Wl. We consider the stack of all residual blocks for the same\nresolution as a stage. Denote the feature at the lth layer of stage s by xl,s. We have:\n\nxl+1 = xl + F(xl, Wl),\n\nxLs,s = x0,s +\n\nF(xl,s, Wl,s),\n\n\u2202L\n\u2202x0,s\n\n\u2202L\n\u2202xLs,s\n\n=\n\n(1 +\n\n\u2202\n\n\u2202x0,s\n\nF(xl,s, Wl,s))\n\n(2)\n\n0,s+1 are different, identity mapping is not applicable.\n\nwhere Ls denotes the number of stacked residual blocks at the stage s, L is a loss function. The\nadditive term \u2202L\nin (2) ensures that the gradient of xLs,s can be directly propagated to x0,s. We\n\u2202xLs,s\nconsider features with different resolutions as having different stages. In the original ResNet, the\nfeatures of different resolutions are different in number of channels. Therefore, a transition function\nh(\u00b7) is needed to change the number of channels before down-sampling:\n0,s+1 = h(xLs,s) = \u03c3(\u03bbs \u2297 xLs,s + bLs,s)\nx(cid:48)\n\n(3)\nwhere \u03c3(\u00b7) is the activation function. \u03bbs is the \ufb01lter and bLs,s is the bias at the transition layer of\nstage s respectively. The symbol \u2297 represents the convolution. Since the numbers of channels for\nxLs,s and x(cid:48)\nGradient propagation problem from Isolated convolution (I-conv). Isolated convolution (I-conv)\nis the convolution in (3) without identity mapping or concatenation. As analyzed and validated\nby experiments in [8], it is desirable to have the gradients from a deep layer directly transmitted\nto shallow layers. Residual blocks with identity mapping [8] and dense block with concatenation\n[13] facilitate such direct gradient propagation. Gradients from the deep layer cannot be directly\ntransmitted to the shallow layers if there is an I-conv. The I-conv between features with different\nresolutions in ResNet [8] and the I-conv (called transition layer in [13]) between adjacent dense\nblocks, however, hinders the direct gradient propagation. Since ResNet and DenseNet still have\nI-convs, the gradients from the output cannot be directly propagated to shallow layers for them,\nsimilarly for the networks in [17, 26]. The invertible down-sampling in [15] avoids the problem of\nI-conv by using all features from the current stage for the next stage. The problem is that it will\nexponentially increase the number of parameters as the stage ID increases (188M in [15]).\nWe have identi\ufb01ed the gradient propagation problem of I-conv in existing networks. Therefore, we\npropose a new architecture, namely FishNet, to solve this problem.\n\n3 The FishNet\nFigure 2 shows an overview of the FishNet. The whole \"\ufb01sh\" is divided into three parts: tail, body,\nand head. The \ufb01sh tail is an existing CNN, e.g. ResNet, with the resolution of features becoming\nsmaller as the CNN goes deeper. The \ufb01sh body has several up-sampling and re\ufb01ning blocks for\nre\ufb01ning features from the tail and the body. The \ufb01sh head has several down-sampling and re\ufb01ning\nblocks for preserving and re\ufb01ning features from the tail, body and head. The re\ufb01ned features at the\nlast convolutional layer of the head are used for the \ufb01nal task.\nStage in this paper refers to a bunch of convolutional blocks fed by the features with the same\nresolution . Each part in the FishNet could be divided into several stages according to the resolution\nof the output features. With the resolution becoming smaller, the stage ID goes higher. For\nexample, the blocks with outputs resolution 56 \u00d7 56 and 28 \u00d7 28 are at stage 1 and 2 respectively in\nall the three parts of the FishNet. Therefore, in the \ufb01sh tail and head, the stage ID is becoming higher\nwhile forwarding, while in the body part the ID is getting smaller.\nFigure 3 shows the interaction among tail, body, and head for features of two stages. The \ufb01sh tail\nin Figure 3(a) could be regarded as a residual network. The features from the tail undergo several\nresidual blocks and are also transmitted to the body through the horizontal arrows. The body in\nFigure 3(a) preserves both the features from the tail and the features from the previous stage of the\nbody by concatenation. Then these concatenated features will be up-sampled and re\ufb01ned with details\n\n4\n\n\fFigure 3: (Better seen in color and zoomed in.) (a) Interaction among the tail, body and head for\nfeatures of two stages, the two \ufb01gures listed on the right exhibit the detailed structure for (b) the\nUp-sampling & Re\ufb01nement block (UR-block), and (c) the Down-sampling & Re\ufb01nement block\n(DR-block). In the Figure (a), feature concatenation is used when vertical and horizontal arrows\nmeet. The notations C\u2217,\u2217H,\u2217W denote the number of channels, height, and width respectively. k\nrepresents the channel-wise reduction rate described in Equation 8 and Section 3.1. Note that there is\nno Isolated convolution (I-conv) in the \ufb01sh body and head. Therefore, the gradient from the loss can\nbe directly propagated to shallow layers in tail, body and head.\n\nshown in Figure 3(b) and the details about the UR-block will be discussed in Section 3.1. The re\ufb01ned\nfeatures are then used for the head and the next stage of the body. The head preserves and re\ufb01nes\nall the features from the body and the previous stage of the head. The re\ufb01ned features are then used\nfor the next stage of the head. Details for message passing at the head are shown in Figure 3(c) and\ndiscussed in Section 3.1. The horizontal connections represent the transferring blocks between the\ntail, the body and the head. In Figure 3(a), we use the residual block as the transferring blocks.\n\n3.1 Feature re\ufb01nement\n\nIn the FishNet, there are two kinds of blocks for up/down sampling and feature re\ufb01nement: the\nUp-sampling & Re\ufb01nement block (UR-block) and Down-sampling & Re\ufb01nement block (DR-block).\nThe UR-block. Denote the output features from the \ufb01rst layer at the stage s by xt\ns for the tail\nand body respectively. s \u2208 {1, 2, ..., min(N t \u2212 1, N b \u2212 1)}, N t and N b represent the number of\nstages for the tail part and the body part. Denote feature concatenation as concat(\u00b7). The UR-block\ncan be represented as follows:\n\ns and xb\n\nxb\ns\u22121 = U R(xb\n\ns,T (xt\n\ns)) = up(\u02dcx(cid:48)b\ns )\n\n(4)\n\nwhere the T denotes residual block transferring the feature xt\nrepresents the feature re\ufb01ned from the previous stage in the \ufb01sh body. The output xb\n\ns\u22121 from tail to the body, the up(\u02dcx(cid:48)b\ns )\ns\u22121 for next stage\n\n5\n\n\ud835\udc361+\ud835\udc362+\ud835\udc363\ud835\udc58\ud835\udc362+\ud835\udc363\ud835\udc4a\ud835\udc3622\ud835\udc4a2\ud835\udc3b\ud835\udc361\ud835\udc363\ud835\udc4a\ud835\udc3b\ud835\udc362+\ud835\udc363\ud835\udc582\ud835\udc4a2\ud835\udc3b\ud835\udc4a\ud835\udc3b\ud835\udc4a2\ud835\udc4a2\ud835\udc3b\ud835\udc364\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026\u2026TransferringBlocks T (\u22c5)DR BlocksUR BlocksRegular Connections\ud835\udc361+\ud835\udc362+\ud835\udc363\ud835\udc58+\ud835\udc364\ud835\udc3b\ud835\udc3b2\ud835\udc3b2\ud835\udc4a\ud835\udc361+\ud835\udc362+\ud835\udc363\ud835\udc58+\ud835\udc364\ud835\udc3b\ud835\udc4a2\ud835\udc4a2\ud835\udc3b\u2026\u2026FishTailFishBodyFishHead(\ud835\udc58+1)\ud835\udc362+\ud835\udc363\ud835\udc58+\ud835\udc361+\ud835\udc364(\ud835\udc4f)(\ud835\udc4e)(\ud835\udc50)M(\u22c5)\ud835\udc51\ud835\udc5c\ud835\udc64\ud835\udc5b(\u22c5)\ud835\udc62\ud835\udc5d(\u22c5)\ud835\udc5f(\u22c5)M(\u22c5)\u2026\u2026\u2026\u2026 \ud835\udc65\ud835\udc60\ud835\udc4f\ud835\udc65\ud835\udc60\u22121\ud835\udc4f\ud835\udc65\ud835\udc60+1\u210e \ud835\udc65\ud835\udc60\u2032\ud835\udc4f\ud835\udc5f( \ud835\udc65\ud835\udc60\ud835\udc4f) \ud835\udc65\ud835\udc60\u210e \ud835\udc65\ud835\udc60\u2032\u210eConcatT (\ud835\udc65\ud835\udc60\ud835\udc4f)\ud835\udc65\ud835\udc60\ud835\udc4fT (\ud835\udc65\ud835\udc60\ud835\udc61)\ud835\udc65\ud835\udc60\u210e\fis re\ufb01ned from xt\n\ns and xb\n\ns as follows:\n\ns\u22121 = up(\u02dcx(cid:48)b\nxb\ns ),\ns) + M(\u02dcxb\n\u02dcx(cid:48)b\ns = r(\u02dcxb\ns),\ns,T (xt\n\u02dcxb\ns = concat(xb\n\ns)),\n\n(5)\n(6)\n(7)\n\nwhere up(\u00b7) denotes the up-sampling function. As a summary, the UR-block concatenates features\nfrom body and tail in (7) and re\ufb01ne them in (6), then upsample them in (5) to obtain the output xb\ns\u22121.\nThe M in (6) denotes the function that extracts the message from features \u02dcxb\ns. We implemented M\nas convolutions. Similar to the residual function F in (1), the M in (6) is implemented by bottleneck\nResidual Unit [8] with 3 convolutional layers. The channel-wise reduction function r in (6) can be\nformulated as follows:\n\nk(cid:88)\n\nr(x) = \u02c6x = [\u02c6x(1), \u02c6x(2), . . . , \u02c6x(cout)], \u02c6x(n) =\n\nx(k \u00b7 n + j), n \u2208 {0, 1, ..., cout},\n\n(8)\n\nj=0\n\nwhere x = {x(1), x(2), . . . , x(cin)} denotes cin channels of input feature maps and \u02c6x denotes cout\nchannels of output feature maps for the function r, cin/cout = k. It is an element-wise summation\nof feature maps from the adjacent k channels to 1 channel. We use this simple operation to reduce\nthe number of channels into 1/k, which makes the number of channels concatenated to the previous\nstage to be small for saving computation and parameter size.\nThe DR-block. The DR-block at the head is similar to the UR-block. There are only two different\nimplementations between them. First, we use 2 \u00d7 2 max-pooling for down-sampling in the DR-block.\nSecond, in the DR-block, the channel reduction function in the UR-block is not used so that the\ngradient at the current stage can be directly transmitted to the parameters at the previous stage.\nFollowing the UR-block in (5)-(7), the DR block can be implemented as follows:\n\ns+1 = down(\u02dcx(cid:48)h\nxh\ns ),\ns + M(\u02dcxh\n\u02dcx(cid:48)h\ns = \u02dcxh\ns ),\ns ,T (xb\n\u02dcxh\ns = concat(xh\n\ns)),\n\n(9)\n\nwhere the xh\ns+1 denotes the features at the head part for the stage s + 1. In this way, the features\nfrom every stage of the whole network is able to be directly connected to the \ufb01nal layer through\nconcatenation, skip connection, and max-pooling. Note that we do not apply the channel-wise\nsummation operation r(\u00b7) de\ufb01ned in (6) to obtain \u02dcxh\ns for the DR-block in (9). Therefore, the\nlayers obtaining \u02dcxh\n\ns in the DR-block could be actually regarded as a residual block [8].\n\ns from xh\n\ns from xh\n\n3.2 Detailed design and discussion\n\nDesign of FishNet for handling the gradient propagation problem. With the body and head\ndesigned in the FishNet, the features from all stages at the tail and body are concatenated at the head.\nWe carefully designed the layers in the head so that there is no I-conv in it. The layers in the head are\ncomposed of concatenation, convolution with identity mapping, and max-pooling. Therefore, the\ngradient propagation problem from the previous backbone network in the tail are solved with the\nFishNet by 1) excluding I-conv at the head; and 2) using concatenation at the body and the head.\nSelection of up/down-sampling function. The kernel size is set as 2 \u00d7 2 for down-sampling with\nstride 2 to avoid the overlapping between pixels. Ablation studies will show the effect of different\nkinds of kernel sizes in the network. To avoid the problem from I-conv, the weighted de-convolution\nin up-sampling method should be avoided. For simplicity, we choose nearest neighbor interpolation\nfor up-sampling. Since the up-sampling operation will dilute input features with lower resolution, we\napply dilated convolution in the re\ufb01ning blocks.\nBridge module between the \ufb01sh body and tail. As the tail part will down sample the features into\nresolution 1 \u00d7 1, these 1 \u00d7 1 features need to be upsampled to 7 \u00d7 7. We apply a SE-block [11] here\nto map the feature from 1 \u00d7 1 into 7 \u00d7 7 using a channel-wise attentive operation.\n\n6\n\n\fFigure 4: The comparison of the classi\ufb01cation top-1 (top-5) error rates as a function of the number of\nparameters (left) and FLOP (right) for FishNet, DenseNet and ResNet (single-crop testing) on the\nvalidation set of ImageNet.\n\n25.0M\n\n22.2%\n\n21.5%\n\nParams Top-1 Error\n\nMethod\nResNeXt-50\n(32 \u00d7 4d)\nFishNeXt-150\n(4d)\nTable 1: ImageNet-1k val Top-1 error for\nthe ResNeXt-based architectures. The\n4d here for FishNeXt-150 (4d) indicates\nthat the minimum number of channels\nfor a single group is 4.\n\n26.2M\n\nMethod\nMax-Pooling\n(3 \u00d7 3, stride=2)\nMax-Pooling\n(2 \u00d7 2, stride=2)\nAvg-Pooling\n(2 \u00d7 2, stride=2)\nConvolution\n(stride=2)\n\nParams Top-1 Error\n\n26.4M\n\n26.4M\n\n26.4M\n\n30.2M\n\n22.51%\n\n21.93%\n\n22.86%\n\n22.75%\n\nTable 2: ImageNet-1k val Top-1 error for different down-\nsampling methods based on FishNet-150.\n\n4 Experiments and Results\n\n4.1\n\nImplementation details on image classi\ufb01cation\n\nFor image classi\ufb01cation, we evaluate our network on the ImageNet 2012 classi\ufb01cation dataset [25]\nthat consists of 1000 classes. This dataset has 1.2 million images for training, and 50,000 images for\nvalidation (denoted by ImageNet-1k val). We implement the FishNet based on the prevalent deep\nlearning framework PyTorch [23]. For training, we randomly crop the images into the resolution\nof 224 \u00d7 224 with batch size 256, and choose stochastic gradient descent (SGD) as the training\noptimizer with the base learning rate set to 0.1. The weight decay and momentum are 10\u22124 and 0.9\nrespectively. We train the network for 100 epochs, and the learning rate is decreased by 10 times\nevery 30 epochs. The normalization process is done by \ufb01rst converting the value of each pixel into the\ninterval [0, 1], and then subtracting the mean and dividing the variance for each channel of the RGB\nrespectively. We follow the way of augmentation (random crop, horizontal \ufb02ip and standard color\naugmentation [17]) used in [9] for fair comparison. All the experiments in this paper are evaluated\nthrough single-crop validation process on the validation dataset of ImageNet-1k. Speci\ufb01cally, an\nimage region of size 224 \u00d7 224 is cropped from the center of an input image with its shorter side\nbeing resized to 256.This 224 \u00d7 224 image region is the input of the network.\nFishNet is a framework. It does not specify the building block. For the experimental results in\nthis paper, FishNet uses the Residual block with identity mapping [8] as the basic building block,\nFishNeXt uses the Residual block with identity mapping and grouping [29] as the building block.\n\n4.2 Experimental results on ImageNet\n\nFigure 4 shows the top-1 error for ResNet, DenseNet, and FishNet as a function of the number of\nparameters on the validation dataset of ImageNet-1k. When our network uses pre-activation ResNet\nas the tail part of the FishNet, the FishNet performs better than ResNet and DenseNet.\nFishNet vs. ResNet. For fair comparison, we re-implement the ResNet and report the result of\nResNet-50 and ResNet-101 in Figure 4. Our reported single-crop result for ResNet-50 and ResNet-\n101 with identity mapping is higher than that in [9] as we select the residual block with pre-activation\nto be our basic building block. Compared to ResNet, FishNet achieves a remarkable reduction in error\nrate. The FishNet-150 (21.93%, 26.4M), for which the number of parameters is close to ResNet-50\n\n7\n\n22.59%21.93%(5.92%)21.55%(5.86%)21.25%(5. 76%)22.58%(6.35%)22.20%(6.20%)22.15%(6.12%)21.20%23.78%(7.00%)22.30%(6.20%)21.69%(5.94%)21.00%21.50%22.00%22.50%23.00%23.50%24.00%10203040506070FishNetDenseNetResNetTop-1(Top-5) ErrorTop-1 Error22.59%21.93%21.55%21.25%22.58%22.20%21.20%23.78%22.30%21.69%21.00%21.50%22.00%22.50%23.00%23.50%24.00%24681012FishNetDenseNetResNetParams, \u00d7106FLOP, \u00d7109\fInstance Segmentation\n\nObject Detection\n\nMask R-CNN\n\nMask R-CNN\n\nFPN\nS/APd\n\nBackbone\nResNet-50 [5]\nResNet-50\u2020\nResNeXt-50 (32x4d)\u2020\nFishNet-150\nvs. ResNet-50\u2020\nvs. ResNeXt-50\u2020\n\nS/APs\n\nAPs/APs\nM /APs\n34.5/15.6/37.1/52.1\n34.7/18.5/37.4/47.7\n35.7/19.1/38.5/48.5\n37.0/19.8/40.2/50.3\n+2.3/+1.3/+2.8/+2.6\n+1.3/+0.7/+1.7/+1.8\n\nL APd/APd\n\nS/APd\n\nL APd/APd\n\nM /APd\nM /APd\nL\n37.9/21.5/41.1/49.9\n38.6/22.2/41.5/50.8\n38.0/21.4/41.6/50.1\n38.7/22.3/42.0/51.2\n39.3/23.2/42.3/51.7\n40.0/23.1/43.0/52.8\n41.5/24.1/44.9/55.0\n40.6/23.3/43.9/53.7\n+2.8/+1.8/+2.9/+3.8 +2.6/+1.9/+2.3/+3.6\n+1.5/+1.0/+1.9/+2.2 +1.3/+0.1/+1.6/+2.0\n\nTable 3: MS COCO val-2017 detection and segmentation Average Precision (%) for different methods.\nAPs\u2217 and APd\u2217 denote the average precision for segmentation and detection respectively. AP\u2217\nS, AP\u2217\nM ,\nand AP\u2217\nL respectively denote the AP for the small, medium and large objects. The back-bone networks\nare used for two different segmentation and detection approaches, i.e. Mask R-CNN [10] and FPN\n[21]. The model re-implemented by us is denoted by a symbol \u2020. FishNet-150 does not use grouping,\nand the number of parameters for FishNet-150 is close to that of ResNet-50 and ResNeXt-50.\n(23.78%, 25.5M), is able to surpass the performance of ResNet-101 (22.30%, 44.5M). In terms of\nFLOPs, as shown in the right \ufb01gure of Figure 4, the FishNet is also able to achieve better performance\nwith lower FLOPs compared with the ResNet.\nFishNet vs. DenseNet. DenseNet iteratively aggregates the features with the same resolution\nby concatenation and then reduce the dimension between each dense-block by a transition layer.\nAccording to the results in Figure 4, DenseNet is able to surpass the accuracy of ResNet using\nfewer parameters. Since FishNet preserves features with more diversity and better handles the\ngradient propagation problem, FishNet is able to achieve better performance than DenseNet with\nfewer parameters. Besides, the memory cost of the FishNet is also lower than the DenseNet. Take\nthe FishNet-150 as an example, when the batch size on a single GPU is 32, the memory cost of\nFishNet-150 is 6505M, which is 2764M smaller than the the cost of DenseNet-161 (9269M).\nFishNeXt vs. ResNeXt The architecture of FishNet could be combined with other kinds of designs,\ne.g., the channel-wise grouping adopted by ResNeXt. We follow the criterion that the number of\nchannels in a group for each block (UR/DR block and transfer block) of the same stage should be the\nsame. The width of a single group will be doubled once the stage index increase by 1. In this way, the\nResNet-based FishNet could be constructed into a ResNeXt-based network, namely FishNeXt. We\nconstruct a compact model FishNeXt-150 with 26 million of parameters. The number of parameters\nfor FishNeXt-150 is close to ResNeXt-50. From Table 1, the absolute top-1 error rate can be reduced\nby 0.7% when compared with the corresponding ResNeXt architecture.\n\n4.3 Ablation studies\n\nPooling vs. convolution with stride. We investigated four kinds of down-sampling methods based\non the network FishNet-150, including convolution, max-pooling with the kernel size of 2 \u00d7 2 and\n3\u00d7 3, and average pooling with kernel size 2\u00d7 21. As shown in Table 2, the performance of applying\n2 \u00d7 2 max-pooling is better than the other methods. Stride-Convolution will hinder the loss from\ndirectly propagating the gradient to the shallow layer while pooling will not. We also \ufb01nd that\nmax-pooling with kernel size 3 \u00d7 3 performs worse than size 2 \u00d7 2, as the structural information\nmight be disturbed by the max-pooling with the 3 \u00d7 3 kernel, which has overlapping pooling window.\nDilated convolution. Yu et al. [32] found that the loss of spatial acuity may lead to the limitation of\nthe accuracy for image classi\ufb01cation. In FishNet, the UR-block will dilute the original low-resolution\nfeatures, therefore, we adopt dilated convolution in the \ufb01sh body. When the dilated kernels is used at\nthe \ufb01sh body for up-sampling, the absolute top-1 error rate is reduced by 0.13% based on FishNet-150.\nHowever, there is 0.1% absolute error rate increase if dilated convolution is used in both the \ufb01sh body\nand head compared to the model without any dilation introduced. Besides, we replace the \ufb01rst 7 \u00d7 7\nstride-convolution layer with two residual blocks, which reduces the absolute top-1 error by 0.18%.\n\n1When convolution with a stride of 2 is used, it is used for both the tail and the head of the FishNet. When\npooling is used, we still put a 1 \u00d7 1 convolution on the skip connection of the last residual blocks for each stage\nat the tail to change the number of channels between two stages, but we do not use such convolution at the head.\n\n8\n\n\f4.4 Experimental investigations on MS COCO\n\nWe evaluate the generalization capability of FishNet on object detection and instance segmentation\non MS COCO [20]. For fair comparison, all models implemented by ourselves use the same settings\nexcept for the network backbone. All the codes implementing the results reported in this paper about\nobject detection and instance segmentation are released at [1].\nDataset and Metrics MS COCO [20] is one of the most challenging datasets for object detection and\ninstance segmentation. There are 80 classes with bounding box annotations and pixel-wise instance\nmask annotations. It consists of 118k images for training (train-2017) and 5k images for validation\n(val-2017). We train our models on the train-2017 and report results on the val-2017. We evaluate\nall models with the standard COCO evaluation metrics AP (averaged mean Average Precision over\ndifferent IoU thresholds) [10], and the APS, APM , APL (AP at different scales).\nImplementation Details We re-implement the Feature Pyramid Networks (FPN) and Mask R-CNN\nbased on PyTorch [23], and report the re-implemented results in Table 3. Our re-implemented results\nare close to the results reported in Detectron[5]. With FishNet, we trained all networks on 16 GPUs\nwith batch size 16 (one per GPU) for 32 epochs. SGD is used as the training optimizer with a learning\nrate 0.02, which is decreased by 10 at the 20 epoch and 28 epoch. As the mini-batch size is small,\nthe batch-normalization layers [14] in our network are all \ufb01xed during the whole training process.\nA warming-up training process [6] is applied for 1 epoch and the gradients are clipped below a\nmaximum hyper-parameter of 5.0 in the \ufb01rst 2 epochs to handle the huge gradients during the initial\ntraining stage. The weights of the convolution on the resolution of 224 \u00d7 224 are all \ufb01xed. We use a\nweight decay of 0.0001 and a momentum of 0.9. The networks are trained and tested in an end-to-end\nmanner. All other hyper-parameters used in experiments follow those in [5].\nObject Detection Results Based on FPN. We report the results of detection using FPN with FishNet-\n150 on val-2017 for comparison. The top-down pathway and lateral connections in FPN are attached\nto the \ufb01sh head. As shown in Table 3, the FishNet-150 obtains a 2.6% absolute AP increase to\nResNet-50, and a 1.3% absolute AP increase to ResNeXt-50.\nInstance Segmentation and Object Detection Results Based on Mask R-CNN. Similar to the\nmethod adopted in FPN, we also plug FishNet into Mask R-CNN for simultaneous segmentation and\ndetection. As shown in Table 3, for the task of instance segmentation, 2.3% and 1.3% absolute AP\ngains are achieved compared to the ResNet-50 and ResNeXt-50. Moreover, when the network is\ntrained in such multi-task fashion, the performance of object detection could be even better. With\nthe FishNet plugged into the Mask R-CNN, 2.8% and 1.5% improvement in absolute AP have been\nobserved compared to the ResNet-50 and ResNeXt-50 respectively.\nNote that FishNet-150 does NOT use channel-wise grouping, and the number of parameters for\nFishNet-150 is close to that of ResNet-50 and ResNeXt-50. When compared with ResNeXt-50,\nFishNet-150 only reduces absolute error rate by 0.2% for image classi\ufb01cation, while it improves the\nabsolute AP by 1.3% and 1.5% respectively for object detection and instance segmentation. This\nshows that the FishNet provides features that are more effective for the region-level task of object\ndetection and the pixel-level task of segmentation.\nCOCO Detection Challenge 2018. FishNet was used as one of the network backbones of the\nwinning entry. By embedding the FishNet into our framework, the single model FishNeXt-229 could\n\ufb01nally achieve 43.3% on the task of instance segmentation on the test-dev set.\n\n5 Conclusion\nIn this paper, we propose a novel CNN architecture to unify the advantages of architectures designed\nfor the tasks recognizing objects on different levels. The design of feature preservation and re\ufb01nement\nnot only helps to handle the problem of direct gradient propagation, but also is friendly to pixel-level\nand region-level tasks. Experimental results have demonstrated and validated the improvement of\nour network. For future works, we will investigate more detailed settings of our network, e.g., the\nnumber of channels/blocks for each stage, and also the integration with other network architectures.\nThe performance for larger models on both datasets will also be reported.\nAcknowledgement We would like to thank Guo Lu and Olly Styles for their careful proofreading.\nWe also appreciate Mr. Hui Zhou at SenseTime Research for his broad network that could incredibly\norganize the authors of this paper together.\n\n9\n\n\fReferences\n[1] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and\n\nD. Lin. mmdetection. https://github.com/open-mmlab/mmdetection, 2018.\n\n[2] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In Advances in Neural Information\n\nProcessing Systems, pages 4470\u20134478, 2017.\n\n[3] X. Chu, W. Ouyang, X. Wang, et al. Crf-cnn: Modeling structured information in human pose estimation.\n\nIn Advances in Neural Information Processing Systems, pages 316\u2013324, 2016.\n\n[4] P. Gao, H. Li, S. Li, P. Lu, Y. Li, S. C. Hoi, and X. Wang. Question-guided hybrid convolution for visual\n\nquestion answering. arXiv preprint arXiv:1808.02632, 2018.\n\n[5] R. Girshick, I. Radosavovic, G. Gkioxari, P. Doll\u00e1r, and K. He. Detectron. https://github.com/\n\nfacebookresearch/detectron, 2018.\n\n[6] P. Goyal, P. Doll\u00e1r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He.\n\nAccurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[7] B. Hariharan, P. Arbel\u00e1ez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and \ufb01ne-\ngrained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 447\u2013456, 2015.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference\n\non Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages\n\n770\u2013778, 2016.\n\n[10] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE\n\nInternational Conference on, pages 2980\u20132988. IEEE, 2017.\n\n[11] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 2017.\n\n[12] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense convolutional\n\nnetworks for ef\ufb01cient prediction. arXiv preprint arXiv:1703.09844, 2017.\n\n[13] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks.\n\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.\n\n[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[15] J.-H. Jacobsen, A. Smeulders, and E. Oyallon.\n\narXiv:1802.07088, 2018.\n\ni-revnet: Deep invertible networks. arXiv preprint\n\n[16] E. Kim, C. Ahn, and S. Oh. Nestednet: Learning nested sparse structures in deep neural networks. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8669\u20138678,\n2018.\n\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[18] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals.\n\narXiv preprint arXiv:1605.07648, 2016.\n\n[19] H. Li, Y. Liu, W. Ouyang, and X. Wang. Zoom out-and-in network with map attention decision for region\nproposal and object detection. International Journal of Computer Vision, Jun 2018. ISSN 1573-1405. doi:\n10.1007/s11263-018-1101-7. URL https://doi.org/10.1007/s11263-018-1101-7.\n\n[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\ncoco: Common objects in context. In European conference on computer vision, pages 740\u2013755. Springer,\n2014.\n\n[21] T.-Y. Lin, P. Doll\u00e1r, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for\n\nobject detection. In CVPR, 2017.\n\n[22] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European\n\nConference on Computer Vision, pages 483\u2013499. Springer, 2016.\n\n[23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and\n\nA. Lerer. Automatic differentiation in pytorch. 2017.\n\n10\n\n\f[24] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation.\nIn International Conference on Medical image computing and computer-assisted intervention, pages 234\u2013\n241. Springer, 2015.\n\n[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer\nVision, 115(3):211\u2013252, 2015.\n\n[26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[27] S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang. Optical \ufb02ow guided feature: A fast and robust\nmotion representation for video action recognition. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 1390\u20131399, 2018.\n\n[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich,\n\net al. Going deeper with convolutions. In CVPR, 2015.\n\n[29] S. Xie, R. Girshick, P. Doll\u00e1r, Z. Tu, and K. He. Aggregated residual transformations for deep neural\nIn Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages\n\nnetworks.\n5987\u20135995. IEEE, 2017.\n\n[30] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang. Learning feature pyramids for human pose estimation. In\n\narXiv preprint arXiv:1708.01101, 2017.\n\n[31] Y. Yang, Z. Zhong, T. Shen, and Z. Lin. Convolutional neural networks with alternately updated clique.\n\narXiv preprint arXiv:1802.10419, 2018.\n\n[32] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks.\n\nRecognition, volume 1, 2017.\n\nIn Computer Vision and Pattern\n\n[33] F. Yu, D. Wang, and T. Darrell. Deep layer aggregation. arXiv preprint arXiv:1707.06484, 2017.\n\n[34] S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.\n\n[35] X. Zeng, W. Ouyang, B. Yang, J. Yan, and X. Wang. Gated bi-directional cnn for object detection. In\n\nEuropean Conference on Computer Vision, pages 354\u2013369. Springer, 2016.\n\n[36] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional\nrandom \ufb01elds as recurrent neural networks. In Proceedings of the IEEE International Conference on\nComputer Vision, pages 1529\u20131537, 2015.\n\n[37] H. Zhou, W. Ouyang, J. Cheng, X. Wang, and H. Li. Deep continuous conditional random \ufb01elds with\nasymmetric inter-object constraints for online multi-object tracking. IEEE Transactions on Circuits and\nSystems for Video Technology, 2018.\n\n11\n\n\f", "award": [], "sourceid": 429, "authors": [{"given_name": "Shuyang", "family_name": "Sun", "institution": "The University of Sydney"}, {"given_name": "Jiangmiao", "family_name": "Pang", "institution": "Zhejiang University"}, {"given_name": "Jianping", "family_name": "Shi", "institution": "Sensetime Group Limited"}, {"given_name": "Shuai", "family_name": "Yi", "institution": "SenseTime Group Limited"}, {"given_name": "Wanli", "family_name": "Ouyang", "institution": "The University of Sydney"}]}