{"title": "Learning Deep Structured Multi-Scale Features using Attention-Gated CRFs for Contour Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 3961, "page_last": 3970, "abstract": "Recent works have shown that exploiting multi-scale representations deeply learned via convolutional neural networks (CNN) is of tremendous importance for accurate contour detection. This paper presents a novel approach for predicting contours which advances the state of the art in two fundamental aspects, i.e. multi-scale feature generation and fusion. Different from previous works directly considering multi-scale feature maps obtained from the inner layers of a primary CNN architecture, we introduce a hierarchical deep model which produces more rich and complementary representations. Furthermore, to refine and robustly fuse the representations learned at different scales, the novel Attention-Gated Conditional Random Fields (AG-CRFs) are proposed. The experiments ran on two publicly available datasets (BSDS500 and NYUDv2) demonstrate the effectiveness of the latent AG-CRF model and of the overall hierarchical framework.", "full_text": "Learning Deep Structured Multi-Scale Features using\n\nAttention-Gated CRFs for Contour Prediction\n\nDan Xu1 Wanli Ouyang2 Xavier Alameda-Pineda3 Elisa Ricci4\n\nXiaogang Wang5 Nicu Sebe1\n\n1The University of Trento, 2The University of Sydney, 3Perception Group, INRIA\n\n4University of Perugia, 5The Chinese University of Hong Kong\n\ndan.xu@unitn.it, wanli.ouyang@sydney.edu.au, xavier.alameda-pineda@inria.fr\n\nelisa.ricci@unipg.it, xgwang@ee.cuhk.edu.hk, niculae.sebe@unitn.it\n\nAbstract\n\nRecent works have shown that exploiting multi-scale representations deeply learned\nvia convolutional neural networks (CNN) is of tremendous importance for accurate\ncontour detection. This paper presents a novel approach for predicting contours\nwhich advances the state of the art in two fundamental aspects, i.e. multi-scale\nfeature generation and fusion. Different from previous works directly consider-\ning multi-scale feature maps obtained from the inner layers of a primary CNN\narchitecture, we introduce a hierarchical deep model which produces more rich\nand complementary representations. Furthermore, to re\ufb01ne and robustly fuse the\nrepresentations learned at different scales, the novel Attention-Gated Conditional\nRandom Fields (AG-CRFs) are proposed. The experiments ran on two publicly\navailable datasets (BSDS500 and NYUDv2) demonstrate the effectiveness of the\nlatent AG-CRF model and of the overall hierarchical framework.\n\n1\n\nIntroduction\n\nConsidered as one of the fundamental tasks in low-level vision, contour detection has been deeply\nstudied in the past decades. While early works mostly focused on low-level cues (e.g. colors, gradients,\ntextures) and hand-crafted features [3, 25, 22], more recent methods bene\ufb01t from the representational\npower of deep learning models [31, 2, 38, 19, 24]. The ability to effectively exploit multi-scale\nfeature representations is considered a crucial factor for achieving accurate predictions of contours\nin both traditional [29] and CNN-based [38, 19, 24] approaches. Restricting the attention on deep\nlearning-based solutions, existing methods [38, 24] typically derive multi-scale representations by\nadopting standard CNN architectures and considering directly the feature maps associated to different\ninner layers. These maps are highly complementary: while the features from the \ufb01rst layers are\nresponsible for predicting \ufb01ne details, the ones from the higher layers are devoted to encode the\nbasic structure of the objects. Traditionally, concatenation and weighted averaging are very popular\nstrategies to combine multi-scale representations (see Fig. 1.a). While these strategies typically lead\nto an increased detection accuracy with respect to single-scale models, they severly simplify the\ncomplex relationship between multi-scale feature maps.\nThe motivational cornerstone of this study is the following research question: is it worth modeling\nand exploiting complex relationships between multiple scales of a deep representation for contour\ndetection? In order to provide an answer and inspired by recent works exploiting graphical models\nwithin deep learning architectures [5, 39], we introduce Attention-Gated Conditional Random Fields\n(AG-CRFs), which allow to learn robust feature map representations at each scale by exploiting the in-\nformation available from other scales. This is achieved by incorporating an attention mechanism [27]\nseamlessly integrated into the multi-scale learning process under the form of gates [26]. Intuitively,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe attention mechanism will further enhance the quality of the learned multi-scale representation,\nthus improving the overall performance of the model.\nWe integrated the proposed AG-CRFs into a two-level hierarchical CNN model, de\ufb01ning a novel\nAttention-guided Multi-scale Hierarchical deepNet (AMH-Net) for contour detection. The hierarchi-\ncal network is able to learn richer multi-scale features than conventional CNNs, the representational\npower of which is further enhanced by the proposed AG-CRF model. We evaluate the effectiveness\nof the overall model on two publicly available datasets for the contour detection task, i.e. BSDS500\n[1] and NYU Depth v2 [33]. The results demonstrate that our approach is able to learn rich and\ncomplementary features, thus outperforming state-of-the-art contour detection methods.\nRelated work. In the last few years several deep learning models have been proposed for detecting\ncontours [31, 2, 41, 38, 24, 23]. Among these, some works explicitly focused on devising multi-scale\nCNN models in order to boost performance. For instance, the Holistically-Nested Edge Detection\nmethod [38] employed multiple side outputs derived from the inner layers of a primary CNN and\ncombine them for the \ufb01nal prediction. Liu et al. [23] introduced a framework to learn rich deep\nrepresentations by concatenating features derived from all convolutional layers of VGG16. Bertasius\net al. [2] considered skip-layer CNNs to jointly combine feature maps from multiple layers. Maninis\net al. [24] proposed Convolutional Oriented Boundaries (COB), where features from different layers\nare fused to compute oriented contours and region hierarchies. However, these works combine the\nmulti-scale representations from different layers adopting concatenation and weighted averaging\nschemes while not considering the dependency between the features. Furthermore, these works do\nnot focus on generating more rich and diverse representations at each CNN layer.\nThe combination of multi-scale representations has been also widely investigated for other pixel-level\nprediction tasks, such as semantic segmentation [43], visual saliency detection [21] and monocular\ndepth estimation [39], and different deep architectures have been designed. For instance, to effectively\naggregate the multi-scale information, Yu et al. [43] introduced dilated convolutions. Yang et al. [42]\nproposed DAG-CNNs where multi-scale feature outputs from different ReLU layers are combined\nthrough element-wise addition operator. However, none of these works incorporate an attention\nmechanism into a multi-scale structured feature learning framework.\nAttention models have been successfully exploited in deep learning for various tasks such as image\nclassi\ufb01cation [37], speech recognition [4] and image caption generation [40]. However, to our\nknowledge, this work is the \ufb01rst to introduce an attention model for estimating contours. Furthermore,\nwe are not aware of previous studies integrating the attention mechanism into a probabilistic (CRF)\nframework to control the message passing between hidden variables. We model the attention as\ngates [26], which have been used in previous deep models such as restricted Boltzman machine\nfor unsupervised feature learning [35], LSTM for sequence learning [12, 6] and CNN for image\nclassi\ufb01cation [44]. However, none of these works explore the possibility of jointly learning multi-scale\ndeep representations and an attention model within a uni\ufb01ed probabilistic graphical model.\n\n2 Attention-Gated CRFs for Deep Structured Multi-Scale Feature Learning\n\n2.1 Problem De\ufb01nition and Notation\n\nGiven an input image I and a generic front-end CNN model with parameters Wc, we consider a set\nof S multi-scale feature maps F = {fs}S\ns=1. Being a generic framework, these feature maps can\nbe the output of S intermediate CNN layers or of another representation, thus s is a virtual scale.\nThe feature map at scale s, fs can be interpreted as a set of feature vectors, fs = {f i\ni=1, where\nN is the number of pixels. Opposite to previous works adopting simple concatenation or weighted\naveraging schemes [16, 38], we propose to combine the multi-scale feature maps by learning a set\ni=1 with a novel Attention-Gated CRF model sketched in Fig.1.\nof latent feature maps hs = {hi\nIntuitively, this allows a joint re\ufb01nement of the features by \ufb02owing information between different\nscales. Moreover, since the information from one scale may or may not be relevant for the pixels at\nanother scale, we utilise the concept of gate, previously introduced in the literature in the case of\ngraphical models [36], in our CRF formulation. These gates are binary random hidden variables that\nse,sr \u2208 {0, 1} is the\npermit or block the \ufb02ow of information between scales at every pixel. Formally, gi\ni=1.\ngate at pixel i of scale sr (receiver) from scale se (emitter), and we also write gse,sr = {gi\nse,sr}N\nsr is updated taking (also) into account the\nPrecisely, when gi\n\n= 1 then the hidden variable hi\n\ns}N\n\ns}N\n\nse,sr\n\n2\n\n\fFigure 1: An illustration of different schemes for multi-scale deep feature learning and fusion. (a)\nthe traditional approach (e.g. concatenation, weighted average), (b) CRF implementing multi-scale\nfeature fusion (c) the proposed AG-CRF-based approach.\ninformation from the se-th layer, i.e. hse. As shown in the following, the joint inference of the hidden\nfeatures and the gates leads to estimating the optimal features as well as the corresponding attention\nmodel, hence the name Attention-Gated CRFs.\n\n2.2 Attention-Gated CRFs\n\ns=1 and, accessorily the attention gate variables G = {gse,sr}S\n\nGiven the observed multi-scale feature maps F of image I, the objective is to estimate the hidden multi-\nse,sr=1.\nscale representation H = {hs}S\nTo do that, we formalize the problem within a conditional random \ufb01eld framework and write the Gibbs\ndistribution as P (H, G|I, \u0398) = exp (\u2212E(H, G, I, \u0398)) /Z (I, \u0398), where \u0398 is the set of parameters\nand E is the energy function. As usual, we exploit both unary and binary potentials to couple the\nhidden variables between them and to the observations. Importantly, the proposed binary potential is\ngated, and thus only active when the gate is open. More formally the general form1 of the energy\nfunction writes:\n\nE(H, G, I, \u0398) =\n\n\u03c6h(hi\n\ns, f i\ns)\n\n+\n\ngi\nse,sr \u03c8h(hi\n\nsr , hj\nse\n\n(cid:88)\n\ni\n\n(cid:88)\n(cid:124)\n\ns\n\n(cid:123)(cid:122)\n\n(cid:88)\n\ni,j\n\n(cid:88)\n(cid:124)\n\nse,sr\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nUnary potential\n\nGated pairwise potential\n\n.\n\n)\n\n(cid:125)\n\n(1)\n\nThe \ufb01rst term of the energy function is a classical unary term that relates the hidden features to the\nobserved multi-scale CNN representations. The second term synthesizes the theoretical contribution\nof the present study because it conditions the effect of the pair-wise potential \u03c8h(hi\n) upon\nthe gate hidden variable gi\nse,sr. Fig. 1c depicts the model formulated in Equ.(1). If we remove the\nattention gate variables, it becomes a general multi-scale CRFs as shown in Fig. 1b.\nGiven that formulation, and as it is typically the case in conditional random \ufb01elds, we exploit the\nmean-\ufb01eld approximation in order to derive a tractable inference procedure. Under this generic form,\nthe mean-\ufb01eld inference procedure writes:\n\nse , hj\nsr\n\n(cid:17)\n\n(cid:16)\n(cid:16)\n\n(cid:88)\n\n(cid:88)\n(cid:110)(cid:88)\n\nj\n\ns(cid:48)(cid:54)=s\nE\nq(hj\n\ns(cid:48) )\n\nj\n\nq(hi\n\ns) \u221d exp\n\nq(gi\n\ns(cid:48),s) \u221d exp\n\n\u03c6h(hi\n\ns, f i\n\ns) +\n\ngi\ns(cid:48),s\n\nE\nq(hi\n\ns)\n\nE\nq(gi\n\ns(cid:48) ,s){gi\n\n(cid:110)\n\n\u03c8h(hi\n\nq(hj\n\ns(cid:48),s}E\n(cid:111)(cid:111)(cid:17)\ns(cid:48) ){\u03c8h(hi\ns, hj\n\ns(cid:48))\n\n,\n\ns, hj\n\ns(cid:48))}\n\n,\n\n(2)\n\n(3)\n\nwhere Eq stands for the expectation with respect to the distribution q.\nBefore deriving these formulae for our precise choice of potentials, we remark that, since the\ngate is a binary variable, the expectation of its value is the same as q(gi\ns(cid:48),s = 1). By de\ufb01ning:\nMi\n\n, the expected value of the gate writes:\n\n(cid:110)(cid:80)\n\ns(cid:48),s = E\n\n(cid:111)(cid:111)\n\nE\nq(hj\n\n\u03c8h(hi\n\ns, hj\n\n(cid:110)\n\ns(cid:48))\n\nq(hi\n\ns(cid:48) )\n\ns)\n\nj\n\ns,s(cid:48) = E\n\u03b1i\n\nq(gi\n\ns(cid:48),s){gi\n\ns(cid:48),s} =\n\nq(gi\n\ns(cid:48),s = 1)\ns(cid:48),s = 0) + q(gi\n\nq(gi\n\ns(cid:48),s = 1)\n\n= \u03c3(cid:0)\n\n(cid:1) ,\n\n\u2212Mi\n\ns(cid:48),s\n\n(4)\n\nwhere \u03c3() denotes the sigmoid function. This \ufb01nding is specially relevant in the framework of CNN\nsince many of the attention models are typically obtained after applying the sigmoid function to the\n\n1One could certainly include a unary potential for the gate variables as well. However this would imply that\nthere is a way to set/learn the a priori distribution of opening/closing a gate. In practice we did not observe any\nnotable difference between using or skipping the unary potential on g.\n\n3\n\nfs+1fs+1(a) Multi-Scale Neural Network(b) Multi-Scale CRFsfsfs1fs1fsIIhs1hs1hshshs+1hs+1fs+1\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7(c) Attention-Gated CRFsfs1fsIgs1,sgs,s+1hs1hshs+1\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\fs(cid:48),s depends on\nfeatures derived from a feed-forward network. Importantly, since the quantity Mi\nthe expected values of the hidden features hi\ns, the AG-CRF framework extends the unidirectional\nconnection from the features to the attention model, to a bidirectional connection in which the\nexpected value of the gate allows to re\ufb01ne the distribution of the hidden features as well.\n\n2.3 AG-CRF Inference\n\nIn order to construct an operative model we need to de\ufb01ne the unary and gated potentials \u03c6h and \u03c8h.\nIn our case, the unary potential corresponds to an isotropic Gaussian:\n\n\u03c6h(hi\n\ns, f i\n\ns) = \u2212\n\nai\ns\n2 (cid:107)hi\n\ns \u2212 f i\n\ns(cid:107)2,\n\n(5)\n\ns > 0 is a weighting factor.\n\nwhere ai\nThe gated binary potential is speci\ufb01cally designed for a two-fold objective. On the one hand, we\nwould like to learn and further exploit the relationships between hidden vectors at the same, as well\nas at different scales. On the other hand, we would like to exploit previous knowledge on attention\nmodels and include linear terms in the potential. Indeed, this would implicitly shape the gate variable\nto include a linear operator on the features. Therefore, we chose a bilinear potential:\n\ns, hj\n\ns(cid:48)) = \u02dchi\n\n\u03c8h(hi\n\nsKi,j\ns,s(cid:48)\n\ns = (hi(cid:62)s , 1)(cid:62) and Ki,j\n\n(6)\ns,s(cid:48) \u2208 R(Cs+1)\u00d7(Cs(cid:48) +1) being Cs the size, i.e. the number of channels,\nwhere \u02dchi\ns,s(cid:48); lj,i(cid:62)s(cid:48),s , 1), then Li,j\nof the representation at scale s. If we write this matrix as Ki,j\ns,s(cid:48)\nexploits the relationships between hidden variables, while li,j\ns(cid:48),s implement the classically used\nlinear relationships of the attention models. In order words, \u03c8h models the pair-wise relationships\nbetween features with the upper-left block of the matrix. Furthemore, \u03c8h takes into account the linear\nrelationships by completing the hidden vectors with the unity. In all, the energy function writes:\n\ns,s(cid:48) = (Li,j\n\ns,s(cid:48) and lj,i\n\ns,s(cid:48), li,j\n\n\u02dchj\ns(cid:48),\n\n(cid:88)\n\n(cid:88)\n\ns\n\ni\n\n(cid:88)\n\nse,sr\n\n(cid:88)\n(cid:88)\n\ni,j\n\nai\ns\n2 (cid:107)hi\n\ns \u2212 f i\ns(cid:107)2 +\n(cid:88)\n\ngi\nse,sr\n\n\u02dchi\nsr Ki,j\n\nsr,se\n\n\u02dchj\nse .\n\n(cid:17)\n\n(Li,j\ns,s(cid:48)\n\n\u00afhj\ns(cid:48) + li,j\n\ns,s(cid:48))\n\n,\n\nE(H, G, I, \u0398) = \u2212\n\n(cid:16)\n\nai\ns\n2\n\n(7)\n\n(8)\n\n(cid:17)\n\nUnder these potentials, we can consequently update the mean-\ufb01eld inference equations to:\n\nq(hi\n\ns) \u221d exp\n\ns(cid:107) \u2212 2hi(cid:62)s f i\ns(cid:48) is the expected a posteriori value of hj\ns(cid:48).\n\n((cid:107)hi\n\n\u2212\n\ns) +\n\nwhere \u00afhj\n(cid:17)\nThe previous expression implies that the a posteriori distribution for hi\nvector of the Gaussian and the function M write:\n\u00afhi\ns =\n\n(cid:16)\u00afhi\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nai\nsf i\n\n(cid:16)\n\ns+\n\n\u00afhj\ns(cid:48)+li,j\n\n(Li,j\ns,s(cid:48)\n\ns(cid:48),s =\n\nsLi,j\ns,s(cid:48)\n\ns,s(cid:48))\n\n\u03b1i\n\ns,s(cid:48)\n\ns,s(cid:48)hi(cid:62)s\n\u03b1i\n\ns(cid:48)(cid:54)=s\n\nj\n\nMi\n\nj\n\n1\nai\ns\n\ns(cid:48)(cid:54)=s\n\nj\n\ns is a Gaussian. The mean\n\ns(cid:48) + \u00afhi(cid:62)s li,j\n\u00afhj\n\ns,s(cid:48) + \u00afhj(cid:62)s(cid:48) lj,i\ns(cid:48),s\n\nwhich concludes the inference procedure. Furthermore, the proposed framework can be simpli\ufb01ed to\nobtain the traditional attention models. In most of the previous studies, the attention variables are\ncomputed directly from the multi-scale features instead of computing them from the hidden variables.\nIndeed, since many of these studies do not propose a probabilistic formulation, there are no hidden\nvariables and the attention is computed sequentially through the scales. We can emulate the same\nbehavior within the AG-CRF framework by modifying the gated potential as follows:\n\n\u02dc\u03c8h(hi\n\ns, f j\n\ns, hj\n\ns(cid:48), f i\n\ns(cid:48)) = hi\n\n(9)\nThis means that we keep the pair-wise relationships between hidden variables (as in any CRF) and let\nthe attention model be generated by a linear combination of the observed features from the CNN, as it\nis traditionally done. The changes in the inference procedure are straightforward and reported in the\nsupplementary material due to space constraints. We refer to this model as partially-latent AG-CRFs\n(PLAG-CRFs), whereas the more general one is denoted as fully-latent AG-CRFs (FLAG-CRFs).\n\ns,s(cid:48) + f j(cid:62)s(cid:48) lj,i\ns(cid:48),s.\n\ns(cid:48) + f i(cid:62)s li,j\n\ns,s(cid:48)hj\n\nsLi,j\n\n2.4\n\nImplementation with neural network for joint learning\n\nIn order to infer the hidden variables and learn the parameters of the AG-CRFs together with those\nof the front-end CNN, we implement the AG-CRFs updates in neural network with several steps:\n\n4\n\n\fFigure 2: An overview of the proposed AMH-Net for contour detection.\n\n(i) message passing from the se-th scale to the current sr-th scale is performed with hse\u2192sr \u2190\nLse\u2192sr \u2297 hse, where \u2297 denotes the convolutional operation and Lse\u2192sr denotes the corresponding\nconvolution kernel, (ii) attention map estimation q(gse,sr = 1) \u2190 \u03c3(hsr (cid:12) (Lse\u2192sr \u2297 hse ) +\nlse\u2192sr \u2297 hse + lsr\u2192se \u2297 hsr ), where Lse\u2192sr, lse\u2192sr and lsr\u2192se are convolution kernels and (cid:12)\nrepresents element-wise product operation, and (iii) attention-gated message passing from other scales\nand adding unary term: \u00afhsr = fsr \u2295 asr\n(q(gse,sr = 1) (cid:12) hse\u2192sr ), where asr encodes the\nsr for weighting the message and can be implemented as a 1 \u00d7 1 convolution. The\neffect of the ai\nsymbol \u2295 denotes element-wise addition. In order to simplify the overall inference procedure, and\nbecause the magnitude of the linear term of \u03c8h is in practice negligible compared to the quadratic\nterm, we discard the message associated to the linear term. When the inference is complete, the \ufb01nal\nestimate is obtained by convolving all the scales.\n\n(cid:80)\n\nse(cid:54)=sr\n\n3 Exploiting AG-CRFs with a Multi-scale Hierarchical Network\n\n, f C\n\nl\n\nl and f M\n\nare further aligned to the dimensions of the feature map f D\n\nAMH-Net Architecture. The proposed Attention-guided Multi-scale Hierarchical Network (AMH-\nNet), as sketched in Figure 2, consists of a multi-scale hierarchical network (MH-Net) together with\nthe AG-CRF model described above. The MH-Net is constructed from a front-end CNN architecture\nsuch as the widely used AlexNet [20], VGG [34] and ResNet [17]. One prominent feature of MH-Net\nis its ability to generate richer multi-scale representations. In order to do that, we perform distinct\nnon-linear mappings (deconvolution D, convolution C and max-pooling M) upon fl, the CNN\nfeature representation from an intermediate layer l of the front-end CNN. This leads to a three-way\nrepresentation: f D\n. Remarkably, while D upsamples the feature map, C maintains its\nl\noriginal size and M reduces it, and different kernel size is utilized for them to have different receptive\n\ufb01elds, then naturally obtaining complementary inter- and multi-scale representations. The f C\nl and\nl by the deconvolutional operation.\nf M\nl\nThe hierarchy is implemented in two levels. The \ufb01rst level uses an AG-CRF model to fuse the three\nrepresentations of each layer l, thus re\ufb01ning the CNN features within the same scale. The second\nlevel of the hierarchy uses an AG-CRF model to fuse the information coming from multiple CNN\nlayers. The proposed hierarchical multi-scale structure is general purpose and able to involve an\narbitrary number of layers and of diverse intra-layer representations.\nEnd-to-End Network Optimization. The parameters of the model consist of the front-end CNN\nparameters, Wc, the parameters to produce the richer decomposition from each layer l, Wl, the\nparameters of the AG-CRFs of the \ufb01rst level of the hierarchy, {WI\nl=1, and the parameters of\nl}L\nthe AG-CRFs of the second level of the hierarchy, WII. L is the number of intermediate layers\nused from the front-end CNN. In order to jointly optimize all these parameters we adopt deep\nsupervision [38] and we add an optimization loss associated to each AG-CRF module. In addition,\nsince the contour detection problem is highly unbalanced, i.e. contour pixels are signi\ufb01cantly less than\nnon-contour pixels, we employ the modi\ufb01ed cross-entropy loss function of [38]. Given a training data\n\n5\n\nAG-CRFAG-CRFAG-CRFAG-CRFCDMDDCMDDDCMDDDCCCCCCDLCDLFront-End CNNC..............................DDDDLDLCConvolutionMMax-poolingDDeconvolutionLLossHIERARCHY 1HIERARCHY 2flfClfMlfDlfM0lfC0l\f(cid:88)\nset D = {(Ip, Ep)}P\n\u03b2\n\n(cid:96)(cid:0)W(cid:1) =\n\n(cid:88)\n\np=1 consisting of P RGB-contour groundtruth pairs, the loss function (cid:96) writes:\n(10)\n\np = 1|Ip; W(cid:1) +(cid:0)1 \u2212 \u03b2(cid:1) (cid:88)\n\np = 0|Ip; W(cid:1),\n\nlog P(cid:0)ek\n\nlog P(cid:0)ek\n\np\n\np\n\np\u2208E+\nek\np | + |E\u2212p |), E+\n\np\u2208E\u2212p\nek\n\np |/(|E+\n\nwhere \u03b2 = |E+\np is the set of contour pixels of image p and W is the set of\nall parameters. The optimization is performed via the back-propagation algorithm with standard\nstochastic gradient descent.\nAMH-Net for contour detection. After training of the whole AMH-Net, the optimized network\nparameters W are used for the contour detection task. Given a new test image I, the L + 1 classi\ufb01ers\nproduce a set of contour prediction maps { \u02c6El}L+1\nl=1 = AMH-Net(I; W). The \u02c6El are obtained\n(cid:80)\nfrom the AG-CRFs with elementary operations as detailed in the supplementary material. We\ninspire from [38] to fuse the multiple scale predictions thus obtaining an average prediction \u02c6E =\n\n\u02c6El/(L + 1).\n\nl\n\n4 Experiments\n\n4.1 Experimental Setup\n\nDatasets. To evaluate the proposed approach we employ two different benchmarks: the BSDS500\nand the NYUDv2 datasets. The BSDS500 dataset is an extended dataset based on BSDS300 [1]. It\nconsists of 200 training, 100 validation and 200 testing images. The groundtruth pixel-level labels for\neach sample are derived considering multiple annotators. Following [38, 41], we use all the training\nand validation images for learning the proposed model and perform data augmentation as described\nin [38]. The NYUDv2 [33] contains 1449 RGB-D images and it is split into three subsets, comprising\n381 training, 414 validation and 654 testing images. Following [38] in our experiments we employ\nimages at full resolution (i.e. 560 \u00d7 425 pixels) both in the training and in the testing phases.\nEvaluation Metrics. During the test phase standard non-maximum suppression (NMS) [9] is \ufb01rst\napplied to produce thinned contour maps. We then evaluate the detection performance of our approach\naccording to different metrics, including the F-measure at Optimal Dataset Scale (ODS) and Optimal\nImage Scale (OIS) and the Average Precision (AP). The maximum tolerance allowed for correct\nmatches of edge predictions to the ground truth is set to 0.0075 for the BSDS500 dataset, and to .011\nfor the NYUDv2 dataset as in previous works [9, 14, 38].\nImplementation Details. The proposed AMH-Net is implemented under the deep learning frame-\nwork Caffe [18]. The implementation code is available on Github2. The training and testing phase\nare carried out on an Nvidia Titan X GPU with 12GB memory. The ResNet50 network pretrained on\nImageNet [8] is used to initialize the front-end CNN of AMH-Net. Due to memory constraints, our\nimplementation only considers three scales, i.e. we generate multi-scale features from three different\nlayers of the front-end CNN (i.e. res3d, res4f, res5c). In our CRF model we consider dependencies\nbetween all scales. Within the AG-CRFs, the kernel size for all convolutional operations is set to\n3 \u00d7 3 with stride 1 and padding 1. To simplify the model optimization, the parameters ai\nsr are set\nas 0.1 for all scales during training. We choose this value as it corresponds to the best performance\nafter cross-validation in the range [0, 1]. The initial learning rate is set to 1e-7 in all our experiments,\nand decreases 10 times after every 10k iterations. The total number of iterations for BSDS500 and\nNYUD v2 is 40k and 30k, respectively. The momentum and weight decay parameters are set to 0.9\nand 0.0002, as in [38]. As the training images have different resolution, we need to set the batch size\nto 1, and for the sake of smooth convergence we updated the parameters only every 10 iterations.\n\n4.2 Experimental Results\nIn this section, we present the results of our evaluation, comparing our approach with several state\nof the art methods. We further conduct an in-depth analysis of our method, to show the impact of\ndifferent components on the detection performance.\nComparison with state of the art methods. We \ufb01rst consider the BSDS500 dataset and compare\nthe performance of our approach with several traditional contour detection methods, including\nFelz-Hut [11], MeanShift [7], Normalized Cuts [32], ISCRA [30], gPb-ucm [1], SketchTokens [22],\n\n2https://github.com/danxuhk/AttentionGatedMulti-ScaleFeatureLearning\n\n6\n\n\fFigure 3: Qualitative results on the BSDS500 (left) and the NYUDv2 (right) test samples. The 2nd\n(4th) and 3rd (6th) columns are the ground-truth and estimated contour maps respectively.\n\nTable 1: BSDS500 dataset: quantitative results.\n\nTable 2: NYUDv2 dataset: quantitative results.\n\nMethod\nHuman\nFelz-Hutt[11]\nMean Shift[7]\nNormalized Cuts[32]\nISCRA[30]\ngPb-ucm[1]\nSketch Tokens[22]\nMCG[28]\nDeepEdge[2]\nDeepContour[31]\nLEP[46]\nHED[38]\nCEDN[41]\nCOB [24]\nRCF [23] (not comp.)\nAMH-Net (fusion)\n\nODS OIS\n.800\n.800\n.640\n.610\n.680\n.640\n.641\n.674\n.752\n.724\n.760\n.726\n.746\n.727\n.779\n.747\n.753\n.772\n.773\n.756\n.793\n.757\n.808\n.788\n.804\n.788\n.793\n.820\n.830\n.811\n.798\n.829\n\nAP\n-\n\n.560\n.560\n.447\n.783\n.727\n.780\n.759\n.807\n.797\n.828\n.840\n.834\n.859\n\n\u2013\n.869\n\nMethod\n\nODS OIS\n\ngPb-ucm [1]\nOEF [15]\nSilberman et al. [33]\nSemiContour [45]\nSE [10]\ngPb+NG [13]\nSE+NG+ [14]\n\nHED (RGB) [38]\nHED (HHA) [38]\nHED (RGB + HHA) [38]\nRCF (RGB) + HHA) [23]\n\nAMH-Net (RGB)\nAMH-Net (HHA)\nAMH-Net (RGB+HHA)\n\n.632\n.651\n.658\n.680\n.685\n.687\n.710\n\n.720\n.682\n.746\n.757\n\n.744\n.716\n.771\n\n.661\n.667\n.661\n.700\n.699\n.716\n.723\n\n.734\n.695\n.761\n.771\n\n.758\n.729\n.786\n\nAP\n\n.562\n\n\u2013\n\u2013\n\n.690\n.679\n.629\n.738\n\n.734\n.702\n.786\n\n\u2013\n\n.765\n.734\n.802\n\nMCG [28], LEP [46], and more recent CNN-based methods, including DeepEdge [2], DeepCon-\ntour [31], HED [38], CEDN [41], COB [24]. We also report results of the RCF method [23], although\nthey are not comparable because in [23] an extra dataset (Pascal Context) was used during RCF\ntraining to improve the results on BSDS500. In this series of experiments we consider AMH-Net with\nFLAG-CRFs. The results of this comparison are shown in Table 1 and Fig. 4a. AMH-Net obtains\nan F-measure (ODS) of 0.798, thus outperforms all previous methods. The improvement over the\nsecond and third best approaches, i.e. COB and HED, is 0.5% and 1.0%, respectively, which is not\ntrivial to achieve on this challenging dataset. Furthermore, when considering the OIS and AP metrics,\nour approach is also better, with a clear performance gap.\nTo perform experiments on NYUDv2, following previous works [38] we consider three different\ntypes of input representations, i.e. RGB, HHA [14] and RGB-HHA data. The results corresponding\nto the use of both RGB and HHA data (i.e. RGB+HHA) are obtained by performing a weighted\naverage of the estimates obtained from two AMH-Net models trained separately on RGB and HHA\nrepresentations. As baselines we consider gPb-ucm [1], OEF [15], the method in [33], SemiCon-\ntour [45], SE [10], gPb+NG [13], SE+NG+ [14], HED [38] and RCF [23]. In this case the results\nare comparable to the RCF [23] since the experimental protocol is exactly the same. All of them\nare reported in Table 2 and Fig. 4b. Again, our approach outperforms all previous methods. In\nparticular, the increased performance with respect to HED [38] and RCF [23] con\ufb01rms the bene\ufb01t of\nthe proposed multi-scale feature learning and fusion scheme. Examples of qualitative results on the\nBSDS500 and the NYUDv2 datasets are shown in Fig. 3.\nAblation Study. To further demonstrate the effectiveness of the proposed model and analyze the\nimpact of the different components of AMH-Net on the countour detection task, we conduct an\n\n7\n\n\f(a) BSDS500\n\n(b) NYUDv2\n\nFigure 4: Precision-Recall Curves on the BSDS500 and NYUDv2 test sets.\n\nTable 3: Performance analysis on NYUDv2 RGB data.\n\nablation study considering the NYUDv2 dataset (RGB data). We tested the following models:\n(i) AMH-Net (baseline), which removes the \ufb01rst-level hierarchy and directly concatenates the\nfeature maps for prediction, (ii) AMH-Net (w/o AG-CRFs), which employs the proposed multi-scale\nhierarchical structure but discards the AG-CRFs, (iii) AMH-Net (w/ CRFs), obtained by replacing\nour AG-CRFs with a multi-scale CRF model without attention gating, (iv) AMH-Net (w/o deep\nsupervision) obtained removing intermediate loss functions in AMH-Net and (v) AMH-Net with the\nproposed two versions of the AG-CRFs model, i.e. PLAG-CRFs and FLAG-CRFs. The results of\nour comparison are shown in Table 3, where we also consider as reference traditional multi-scale\ndeep learning models employing multi-scale representations, i.e. Hypercolumn [16] and HED [38].\nThese results clearly show the advantages of\nour contributions. The ODS F-measure of\nAMH-Net (w/o AG-CRFs) is 1.1% higher\nthan AMH-Net (baseline), clearly demon-\nstrating the effectiveness of the proposed hi-\nerarchical network and con\ufb01rming our intu-\nition that exploiting more richer and diverse\nmulti-scale representations is bene\ufb01cial. Ta-\nble 3 also shows that our AG-CRFs plays\na fundamental role for accurate detection,\nas AMH-Net (w/ FLAG-CRFs) leads to an\nimprovement of 1.9% over AMH-Net (w/o\nAG-CRFs) in terms of OSD. Finally, AMH-Net (w/ FLAG-CRFs) is 1.2% and 1.5% better than\nAMH-Net (w/ CRFs) in ODS and AP metrics respectively, con\ufb01rming the effectiveness of embedding\nan attention mechanism in the multi-scale CRF model. AMH-Net (w/o deep supervision) decreases\nthe overall performance of our method by 1.9% in ODS, showing the crucial importance of deep su-\npervision for better optimization of the whole AMH-Net. Comparing the performance of the proposed\ntwo versions of the AG-CRF model, i.e. PLAG-CRFs and FLAG-CRFs, we can see that AMH-Net\n(FLAG-CRFs) slightly outperforms AMH-Net (PLAG-CRFs) in both ODS and OIS, while bringing a\nsigni\ufb01cant improvement (around 2%) in AP. Finally, considering HED [38] and Hypercolumn [16],\nit is clear that our AMH-Net (FLAG-CRFs) is signi\ufb01cantly better than these methods. Importantly,\nour approach utilizes only three scales while for HED [38] and Hypercolumn [16] we consider \ufb01ve\nscales. We believe that our accuracy could be further boosted by involving more scales.\n5 Conclusions\n\nMethod\nHypercolumn [16]\nHED [38]\nAMH-Net (baseline)\nAMH-Net (w/o AG-CRFs)\nAMH-Net (w/ CRFs)\nAMH-Net (w/o deep supervision)\nAMH-Net (w/ PLAG-CRFs)\nAMH-Net (w/ FLAG-CRFs)\n\nODS OIS\n.729\n.718\n.720\n.734\n.720\n.711\n.732\n.722\n.742\n.732\n.738\n.725\n.737\n.749\n.758\n.744\n\nAP\n.731\n.734\n.724\n.739\n.750\n.747\n.746\n.765\n\nWe presented a novel multi-scale convolutional neural network for contour detection. The proposed\nmodel introduces two main components, i.e. a hierarchical architecture for generating more rich\nand complementary multi-scale feature representations, and an Attention-Gated CRF model for\nrobust feature re\ufb01nement and fusion. The effectiveness of our approach is demonstrated through\nextensive experiments on two public available datasets and state of the art detection performance is\n\n8\n\nRecall00.10.20.30.40.50.60.70.80.91Precision00.10.20.30.40.50.60.70.80.91[F=0.800] Human[F=0.798] AMH-Net[F=0.793] COB[F=0.788] CEDN[F=0.788] HED[F=0.757] LEP[F=0.756] DeepContour[F=0.753] DeepEdge[F=0.747] MCG[F=0.727] SketchTokens[F=0.726] UCM[F=0.724] ISCRA[F=0.641] Normalized Cuts[F=0.640] MeanShift[F=0.610] Felz-HutRecall00.10.20.30.40.50.60.70.80.91Precision00.10.20.30.40.50.60.70.80.91[F=0.800] Human[F=0.771] AMH-Net[F=0.746] HED[F=0.706] SE+NG+[F=0.695] SE[F=0.685] gPb+NG[F=0.680] SemiContour[F=0.658] Silberman[F=0.651] OEF[F=0.632] gPb-ucm\fachieved. The proposed approach addresses a general problem, i.e. how to generate rich multi-scale\nrepresentations and optimally fuse them. Therefore, we believe it may be also useful for other\npixel-level tasks.\n\nReferences\n[1] P. Arbel\u00e1ez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation.\n\nTPAMI, 33(5), 2011.\n\n[2] G. Bertasius, J. Shi, and L. Torresani. Deepedge: A multi-scale bifurcated deep network for top-down\n\ncontour detection. In CVPR, 2015.\n\n[3] J. Canny. A computational approach to edge detection. TPAMI, (6):679\u2013698, 1986.\n[4] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models for speech\n\nrecognition. In NIPS, 2015.\n\n[5] X. Chu, W. Ouyang, X. Wang, et al. Crf-cnn: Modeling structured information in human pose estimation.\n\nIn NIPS, 2016.\n\n[6] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on\n\nsequence modeling. arXiv preprint arXiv:1412.3555, 2014.\n\n[7] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. TPAMI, 24(5),\n\n2002.\n\n[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\n[9] P. Doll\u00e1r and C. L. Zitnick. Structured forests for fast edge detection. In ICCV, 2013.\n[10] P. Doll\u00e1r and C. L. Zitnick. Fast edge detection using structured forests. TPAMI, 37(8):1558\u20131570, 2015.\n[11] P. F. Felzenszwalb and D. P. Huttenlocher. Ef\ufb01cient graph-based image segmentation. IJCV, 59(2), 2004.\n[12] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with lstm recurrent networks.\n\nJournal of machine learning research, 3(Aug):115\u2013143, 2002.\n\n[13] S. Gupta, P. Arbelaez, and J. Malik. Perceptual organization and recognition of indoor scenes from rgb-d\n\nimages. In CVPR, 2013.\n\n[14] S. Gupta, R. Girshick, P. Arbel\u00e1ez, and J. Malik. Learning rich features from rgb-d images for object\n\ndetection and segmentation. In ECCV, 2014.\n\n[15] S. Hallman and C. C. Fowlkes. Oriented edge forests for boundary detection. In CVPR, 2015.\n[16] B. Hariharan, P. Arbel\u00e1ez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and \ufb01ne-\n\ngrained localization. In CVPR, 2015.\n\n[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint\n\narXiv:1512.03385, 2015.\n\n[18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. In ACM MM, 2014.\n\n[19] I. Kokkinos. Pushing the boundaries of boundary detection using deep learning. arXiv preprint\n\narXiv:1511.07386, 2015.\n\n[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[21] G. Li and Y. Yu. Visual saliency based on multiscale deep features. In CVPR, 2015.\n[22] J. J. Lim, C. L. Zitnick, and P. Doll\u00e1r. Sketch tokens: A learned mid-level representation for contour and\n\nobject detection. In CVPR, 2013.\n\n[23] Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai. Richer convolutional features for edge detection. arXiv\n\npreprint arXiv:1612.02103, 2016.\n\n[24] K.-K. Maninis, J. Pont-Tuset, P. Arbel\u00e1ez, and L. Van Gool. Convolutional oriented boundaries. In ECCV,\n\n2016.\n\n[25] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local\n\nbrightness, color, and texture cues. TPAMI, 26(5):530\u2013549, 2004.\n\n[26] T. Minka and J. Winn. Gates. In NIPS, 2009.\n[27] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In NIPS, pages 2204\u20132212,\n\n2014.\n\n[28] J. Pont-Tuset, P. Arbelaez, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping for\n\nimage segmentation and object proposal generation. TPAMI, 2016.\n\n[29] X. Ren. Multi-scale improves boundary detection in natural images. In ECCV, 2008.\n[30] Z. Ren and G. Shakhnarovich. Image segmentation by cascaded region agglomeration. In CVPR, 2013.\n[31] W. Shen, X. Wang, Y. Wang, X. Bai, and Z. Zhang. Deepcontour: A deep convolutional feature learned by\n\npositive-sharing loss for contour detection. In CVPR, 2015.\n\n[32] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8), 2000.\n\n9\n\n\f[33] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd\n\nimages. In ECCV, 2012.\n\n[34] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\narXiv preprint arXiv:1409.1556, 2014.\n\n[35] Y. Tang. Gated boltzmann machine for recognition under occlusion. In NIPS Workshop on Transfer\n\nLearning by Learning Rich Generative Models, 2010.\n\n[36] J. Winn. Causality with gates. In AISTATS, 2012.\n[37] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang. The application of two-level attention models in\n\ndeep convolutional neural network for \ufb01ne-grained image classi\ufb01cation. In CVPR, 2015.\n\n[38] S. Xie and Z. Tu. Holistically-nested edge detection. In ICCV, 2015.\n[39] D. Xu, E. Ricci, W. Ouyang, X. Wang, and N. Sebe. Multi-scale continuous crfs as sequential deep\n\nnetworks for monocular depth estimation. CVPR, 2017.\n\n[40] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show,\n\nattend and tell: Neural image caption generation with visual attention. In ICML, 2015.\n\n[41] J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang. Object contour detection with a fully convolutional\n\nencoder-decoder network. 2016.\n\n[42] S. Yang and D. Ramanan. Multi-scale recognition with dag-cnns. In ICCV, 2015.\n[43] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions.\n\narXiv:1511.07122, 2015.\n\narXiv preprint\n\n[44] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang, et al. Crafting\n\ngbd-net for object detection. arXiv preprint arXiv:1610.02579, 2016.\n\n[45] Z. Zhang, F. Xing, X. Shi, and L. Yang. Semicontour: A semi-supervised learning approach for contour\n\ndetection. In CVPR, 2016.\n\n[46] Q. Zhao. Segmenting natural images with the least effort as humans. In BMVC, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2131, "authors": [{"given_name": "Dan", "family_name": "Xu", "institution": "University of Trento"}, {"given_name": "Wanli", "family_name": "Ouyang", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Xavier", "family_name": "Alameda-Pineda", "institution": "INRIA"}, {"given_name": "Elisa", "family_name": "Ricci", "institution": null}, {"given_name": "Xiaogang", "family_name": "Wang", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Nicu", "family_name": "Sebe", "institution": "University of Trento"}]}