{"title": "Neural Diffusion Distance for Image Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 1443, "page_last": 1453, "abstract": "Diffusion distance is a spectral method for measuring distance among nodes on graph considering global data structure. In this work, we propose a spec-diff-net for computing diffusion distance on graph based on approximate spectral decomposition. The network is a differentiable deep architecture consisting of feature extraction and diffusion distance modules for computing diffusion distance on image by end-to-end training. We design low resolution kernel matching loss and high resolution segment matching loss to enforce the network's output to be consistent with human-labeled image segments. To compute high-resolution diffusion distance or segmentation mask, we design an up-sampling strategy by feature-attentional interpolation which can be learned when training spec-diff-net. With the learned diffusion distance, we propose a hierarchical image segmentation method outperforming previous segmentation methods. Moreover, a weakly supervised semantic segmentation network is designed using diffusion distance and achieved promising results on PASCAL VOC 2012 segmentation dataset.", "full_text": "Neural Diffusion Distance for Image Segmentation\n\nJian Sun and Zongben Xu\n\nSchool of Mathematics and Statistics\nXi\u2019an Jiaotong University, P. R. China\n\n{jiansun,zbxu}@xjtu.edu.cn\n\nAbstract\n\nDiffusion distance is a spectral method for measuring distance among nodes on\ngraph considering global data structure. In this work, we propose a spec-diff-net\nfor computing diffusion distance on graph based on approximate spectral decom-\nposition. The network is a differentiable deep architecture consisting of feature\nextraction and diffusion distance modules for computing diffusion distance on\nimage by end-to-end training. We design low resolution kernel matching loss\nand high resolution segment matching loss to enforce the network\u2019s output to\nbe consistent with human-labeled image segments. To compute high-resolution\ndiffusion distance or segmentation mask, we design an up-sampling strategy by\nfeature-attentional interpolation which can be learned when training spec-diff-net.\nWith the learned diffusion distance, we propose a hierarchical image segmentation\nmethod outperforming previous segmentation methods. Moreover, a weakly su-\npervised semantic segmentation network is designed using diffusion distance and\nachieved promising results on PASCAL VOC 2012 segmentation dataset.\n\n1\n\nIntroduction\n\nSpectral analysis is a popular technique for diverse applications in computer vision and machine\nlearning, such as semi-supervised learning on graph [39], image segmentation [17, 31], image\nmatting [21], 3D shape analysis [36], etc. Spectral clustering and diffusion distance are two typical\nspectral techniques that rely on af\ufb01nity matrix over a graph. By decomposing the af\ufb01nity matrix\nusing spectral decomposition, the corresponding eigenvectors encode the global structure of data, and\ncan be utilized for spectral clustering, diffusion distance computation, image segmentation, etc.\nComputing af\ufb01nity matrix on graph for identifying the relations of each node w.r.t. other nodes\nis a fundamental task with potential applications in image segmentation [31], interactive image\nlabeling [11] , object semantic segmentation [18, 22], video recognition [35], etc. Traditionally, the\naf\ufb01nity matrix is either based on hand-crafted features [11, 31] or directly computed based on pairwise\nfeature similarity of graph nodes without considering global structure of underlying graph [35, 37].\nIn this work, we propose neural diffusion distance (NDD) on image inspired by diffusion distance [7,\n8], which is a spectral method for computing pairwise distance considering global data structure by\nspectral analysis. We propose to compute neural diffusion distance on image using a novel deep\narchitecture, dubbed as spec-diff-net. This network consists of a feature extraction module, and a\ndiffusion distance module including the computations of probabilistic transition matrix, spectral\ndecomposition and diffusion distance, in an end-to-end trainable system.\nTo enable computation of spectral decomposition in an ef\ufb01cient and differentiable way, we use\nsimultaneous iteration [12, 32] for approximating the eigen-decomposition of transition matrix. Since\nthe neural diffusion distance is computed on the feature grid with lower resolution than full image,\nwe propose a learnable up-sampling strategy in spec-diff-net using feature-attentional interpolation\nfor interpolating diffusion distance or segmentation map. The spec-diff-net is trained to constrain that\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fits output neural diffusion distance should be consistent with human-labeled segmentation masks\nusing Berkeley segmentation dataset (BSD) [28].\nWe apply neural diffusion distance to two segmentation tasks, i.e., hierarchical image segmentation\nand weakly supervised semantic segmentation. For the \ufb01rst task, we design a hierarchical clustering\nalgorithm based on NDD, achieving signi\ufb01cantly higher segmentation accuracy. For the second\ntask, with the NDD as guidance, we propose an attention module using regional feature pooling for\nweakly supervised semantic segmentation. It achieves state-of-the-art semantic segmentation results\non PASCAL VOC 2012 segmentation dataset [23] in weakly supervised setting.\nOur contributions can be summarized as follows. First, a novel neural diffusion distance and its deep\narchitecture were proposed. Second, with neural diffusion distance, we designed a novel hierarchical\nclustering method and a weakly supervised semantic segmentation method, achieving state-of-the-art\nperformance for image segmentation. Moreover, though we learn NDD on image, it can also be\npotentially applied to general data graph beyond image, deserving investigation in the future.\n\n2 Related works\n\nTraditional spectral clustering [26] or diffusion distance [25] rely on hand-crafted features for\nconstructing af\ufb01nity matrix. In [11], diffusion distance was computed based on color and textures.\nIt was taken as the spatial range for applying image editing. In [1], a learning-based method was\nproposed for spectral clustering by de\ufb01ning a novel cost function differentiable to the af\ufb01nity matrix.\nRecently, spectral analysis was combined with deep learning. Spectral network [3] is a pioneering\nnetwork extending conventional CNN on grid to graph by de\ufb01ning convolution using spectral\ndecomposition of graph Laplacian. The af\ufb01nity matrix and its spectral decomposition are pre-\ncomputed. Diffusion net [24] is de\ufb01ned as an auto-encoder for manifold learning. The encoding\nprocedure maps high-dimensional dataset into a low dimensional embedding space approximating\ndiffusion maps, and the decoder maps from embedding space back to data space. Similarly, [2, 30]\nlearn a mapping from data to its eigen-space of graph Laplacian matrix, then cluster the data by\nspectral clustering. The af\ufb01nity matrix is separately learned by a siamese network in [30]. These\nnetworks were applied to toy datasets for data clustering. The most similar work to ours is [17], in\nwhich an end-to-end learned spectral clustering algorithm was proposed based on subspace alignment\ncost which is differentiable to feature extractor using gradients of SVD / eigen-decomposition. This\ndeep spectral network was successfully applied to natural image segmentation.\nAnother category of related research is deep embedding methods that directly measure the distance /\nsimilarity of pixels in the deep embedded feature space [4, 5, 6, 14, 19]. For example, [5, 6] learned\nthe embedding feature space and relied on metric learning to learn similarity of paired pixels for\nvideo segmentation. Compared with them, our neural diffusion distance also works in embedded\nfeature space, but measures pixel distance by diffusion on graph in a concept of diffusion distance,\nand distances are computed in the eigen-space of transition matrix (i.e., diffusion maps). This results\nin more smooth and continuous diffusion distance maps for image, as will be shown in experiments.\nOur proposed neural diffusion distance bridges diffusion distance and deep learning in an effective\nway. Compared with traditional diffusion distance [7, 8, 25], NDD is based on an end-to-end\ntrainable deep architecture with learned features and hyper-parameters. Compared with (deep)\nspectral clustering [17, 26], our segmentation method is built based on NDD considering global\nimage structure when measuring af\ufb01nity of image pixels. As shown in experiments, NDD enables\nstate-of-the-art results for image segmentation and weakly supervised semantic segmentation.\n\n3 Diffusion map and diffusion distance\n\nWe \ufb01rst brie\ufb02y introduce the basic theory of diffusion distance [7, 8, 11] on a graph. Given a graph\nG = (V, E) with N nodes V = {v1, v2,\u00b7\u00b7\u00b7 , vN} and edge set E. Assume that fi is the feature vector\nof node i (i = 1, 2,\u00b7\u00b7\u00b7 , N) . We \ufb01rst de\ufb01ne similarity matrix W with each element wij as\n\n(1)\nwhere SN (i) is neighborhood set of i. Then the probabilistic transition matrix P can be derived by\nnormalizing each row of W :\n\nwij = exp(\u2212\u00b5||fi \u2212 fj||2\n\n2), for j \u2208 SN (i),\n\nP = D\u22121W, where D = diag(W(cid:126)1).\n\n(2)\n\n2\n\n\fFigure 1: The spec-diff-net consists of a feature extraction module, followed by diffusion distance\nmodule, successively computing transition matrix, approximate spectral decomposition and diffusion\ndistance. It is trained using HR segment matching loss and LR kernel matching loss.\n\nEach element Pij of P is the probability of a random walker walking from node i to node j, and the\n(i, j)-th element of P t re\ufb02ects the probability to move from a node i to j in t time steps. Diffusion\ndistance Dt(i, j) is de\ufb01ned as sum of squared difference between the probabilities that random walker\nstarting from two nodes i, j and end up at a same node in the graph at time t:\n\nDt(i, j) =\n\n(p(k, t|i) \u2212 p(k, t|j))2 \u02dcw(k),\n\n(3)\n\nwhere p(k, t|i) is the probability that a random walk starting from node i and end-up at node k in t\ntime steps, and \u02dcw(k) is the reciprocal of the local density at node k. The diffusion distance will be\nsmall if there is a large number of short paths connecting these two points. Moreover, as t increases,\nthe diffusion distance between two nodes will decrease. The diffusion distance considers the global\ndata structure and is more robust to noises compared with geodesic distance [7].\nSuppose that P has a set of N eigenvalues {\u03bbm}N\u22121\nm=0 with decreasing order, and the corresponding\neigenvectors are \u03a60,\u00b7\u00b7\u00b7 , \u03a6N\u22121. When the graph has non-zero connections between each pair of\nnodes, the eigenvalues satisfy that 1 = \u03bb0 \u2265 \u03bb1 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbN\u22121. Then the diffusion distance is\n\nDt(i, j) =\n\nm(\u03a6m(i) \u2212 \u03a6m(j))2,\n\u03bb2t\n\n(4)\n\n(cid:88)\n\nk\n\nN\u22121(cid:88)\n\nm=0\n\nwhich is Euclidean distance in embedded space spanned by diffusion maps: \u03bbt\n\n0\u03a60,\u00b7\u00b7\u00b7 , \u03bbt\n\nN\u22121\u03a6N\u22121.\n\n4 Learning neural diffusion distance on image\n\nWe next design a deep architecture, dubbed as spec-diff-net, to compute diffusion distance by\nconcatenating feature extraction and diffusion distance computation in a single pipeline.\n\n4.1 Network architecture\n\nAs shown in Fig. 1, given an input image I, spec-diff-net successively processes the image by feature\nextraction module and diffusion distance module consisting of computations of transition matrix,\neigen-decomposition and diffusion distance. Its output is called neural diffusion distance, which is\nsent to training loss for end-to-end training.\nFeature extraction module. For extracting features from image I, it consists of repetitions of\nconvolution, ReLU and max-pooling layers. We denote this module as f (I; \u0398) with network\nparameters \u0398, then its output is features F \u2208 Rw\u00d7h\u00d7d and can be reshaped to RN\u00d7d (N = w \u00d7 h).\nDiffusion distance module. Based on features F , this module \ufb01rst computes transition matrix\nP = D\u22121W, W = exp(\u2212\u00b5||fi \u2212 fj||2), and fi is feature of i. Then it computes eigen-decomposition\nof P as discussed in sect. 4.2. Suppose \u039b = {\u03bb1,\u00b7\u00b7\u00b7 , \u03bbN} and \u03a6 are eigenvalues and matrix of\neigenvectors, then the diffusion distance between i and j on feature grid can be computed by Eq. (4).\n\n4.2 Approximation of spectral decomposition\nAn essential component in spec-diff-net is spectral decomposition of transition matrix P \u2208 RN\u00d7N .\nThe complexity of its spectral decomposition is commonly O(N 3). For better adapting to larger N,\n\n3\n\n\u2026\u2026Approximate spectraldecompositionDiffusion DistancesLR kernel matching lossFPD\u00d712\u00d714\u00d718HR segment matching loss\u2026\u2026==Feature extraction moduleDiffusion distance module\fFigure 2: Neural diffusion similarity maps of image pixels indicated by red dots. In (a), the middle\nand right images are neural diffusion similarity w/o and with feature-attentional interpolation (FAI).\n\nwe design a differentiable approximation of spectral decomposition based on simultaneous iteration\nalgorithm [12, 32], which is an extension of power iteration to approximately compute a set of\nNe dominant eigenvalues and eigenvectors of a matrix. The algorithm initializes Ne dominant\neigenvectors by a matrix U0 in size of N \u00d7 Ne, then iteratively runs\n\nZn+1 = P Un, {Un+1, Rn+1} = QR(Zn+1), n = 0,\u00b7\u00b7\u00b7 , T,\n\n(5)\nwhere QR stands for QR-decomposition. It can be proved that, as n \u2192 \u221e, Un and diagonal values\nof Rn respectively approximate the dominant Ne eigenvectors and corresponding eigenvalues.\nAs shown in Eq. (4), we aim to compute eigenvectors together with powered version of eigenvalues\n\u03bb2t of P . We therefore utilize simultaneous iteration algorithm to compute spectral decomposition\nof P 2t, i.e., taking P 2t to substitute P in Eq. (5). The following proposition shows that this simple\nrevision (we call it accelerated simultaneous iteration) can improve the convergence rate.\nProposition 1. Assume eigenvalues of P satisfy \u03bb0 > \u03bb1 > \u00b7\u00b7\u00b7 > \u03bbNe\u22121 > \u03bbNe, and all leading\nprincipal sub-matrices of \u0393T U0 (\u0393 is a matrix with columns \u03a61,\u00b7\u00b7\u00b7 , \u03a6Ne) are non-singular, then\ncolumns of Un converge to top Ne eigenvectors in linear rate of (maxk\u2208[1,Ne]{|\u03bbk|/|\u03bbk\u22121|})2t, and\nNe\u22121 in same rate.\ndiagonal values of Rn converge to corresponding top Ne eigenvalues \u03bb2t\nPlease see supplementary material for its proof. By approximating spectral decomposition of\nP 2t instead of P , convergence rate is improved from linear rate of maxk\u2208[1,Ne]{|\u03bbk|/|\u03bbk\u22121|} to\nmaxk\u2208[1,Ne]{(|\u03bbk|/|\u03bbk\u22121|)2t} if t > 0.5. Since computational complexity of QR decomposition\nis O(NeN 2), then that of simultaneous iteration is O(T NeN 2). As discussed later, we only retain\ntop Ne (cid:28) N (Ne = 50) eigenvalues, and truncate iterations T (T = 2), therefore, the complexity\nO(T NeN 2) is smaller than original eigen-decomposition in O(N 3) when N is large.\n\n0 ,\u00b7\u00b7\u00b7 , \u03bb2t\n\n4.3 Up-sampling by feature-attentional interpolation\n\nThe diffusion distance is computed on the feature grid of F which is in lower-resolution compared\nwith input image, we therefore design an interpolation method to up-sample the diffusion distance\nmap (or segmentation map). The feature extractor in spec-diff-net can output multi-scale features\nF 0,\u00b7\u00b7\u00b7 , F L by its intermediate layers with feature grids of \u21260,\u00b7\u00b7\u00b7 , \u2126L from high resolution to low\nresolution. We interpolate a map yL from coarsest to \ufb01nest level step by step. Suppose we already\nhave the map yl at level l, we interpolate it to the \ufb01ner level l \u2212 1 by feature-attentional interpolation:\n\nexp(\u2212\u03b3||f l\u22121\n\ni \u2212 f l\u22121\n\nj\n\n)||2)yl\n\nj, i \u2208 \u2126l\u22121,\n\n(6)\n\n(cid:88)\n\ni =(cid:80)\n\nyl\u22121\ni =\n\n1\nZ l\u22121\nj\u2208 \u02dc\u2126l\u2229Sat(i) exp(\u2212\u03b3||f l\u22121\n\nj\u2208 \u02dc\u2126l\u2229Sat(i)\n\ni\n\ni \u2212 f l\u22121\n\n)||2) is the normalization factor, Sat(i) is a region\nwhere Z l\u22121\nneighboring pixel i, \u02dc\u2126l is the grid by up-scaling grid coordinates of \u2126l to the \ufb01nner scale coordinate\nsystem of \u2126l\u22121, j \u2208 \u02dc\u2126l \u2229 Sat(i) is a point in \u02dc\u2126l neighboring i at (l \u2212 1)-th level, and f l\u22121\nis its\ncorresponding feature which is bi-linearly interpolated if it is not at integer coordinates. In this way,\neach pixel of up-sampled map yl\u22121 is the weighted combination of values of its neighboring pixels\nup-sampled from lower-resolution grid, and the weights are computed based on feature similarity. All\nthe computations are differentiable, and will be incorporated into spec-diff-net as discussed in sect. 5.\n\nj\n\nj\n\n4\n\n(a) Neural diffusion similarity maps w/o (middle) vs. with (right) FAI (b) More examples of neural diffusion similarity maps\f5 Network training for learning neural diffusion distance\n\nWe train spec-diff-net on image by enforcing its output, i.e., neural diffusion distance, to be consistent\nwith human labeled segmentations in training set. Please see Fig. 2 for examples of learned neural\ndiffusion distance (similarity). We de\ufb01ne two training losses to learn neural diffusion distance.\nLow-resolution (LR) kernel matching loss. Given output neural diffusion distance matrix Dt\nwith element measuring diffusion distance of paired pixels, we \ufb01rst transform it to neural diffusion\nsimilarity matrix KD = exp(\u2212\u03c4 Dt). Then this loss enforces that KD measuring similarities of\npaired pixels at low resolution feature grid should be consistent with Kgt de\ufb01ned by human-labeled\nsegmentation, i.e., (i, j)-element of Kgt is 1 if i, j are in a same segment, and zero otherwise. Then\nwe de\ufb01ne the LR kernel matching loss as\n\nLlr(KD, Kgt) = \u2212(cid:104)KD/||KD||F , Kgt/||Kgt||F(cid:105) .\n\n(7)\n\nHigh-resolution (HR) segment matching loss. We de\ufb01ne neural diffusion similarity map of pixel i\nas i-th row of KD (denoted as K i\nD) measuring similarities of i with remaining pixels. We enforce\nthat neural diffusion similarity map of each pixel i is consistent with labeled segmentation mask at\nimage resolution. To reduce training overhead, we randomly select pixel set S including one sample\nfor each segment in human labeled segmentation, then high-resolution segment matching loss is\n\nLhr(KD, \u02c6Kgt) =\n\nD/|| \u02c6K i\n\nD||, \u02c6K i\n\ngt/|| \u02c6K i\n\n,\n\n(8)\n\n\u2212(cid:68) \u02c6K i\n\n(cid:88)\n\ni\u2208S\n\n(cid:69)\ngt||)\n\nwhere \u02c6Kgt is the ground-truth human-labeled similarity matrix at image resolution, \u02c6K i\nD =\nUpSample(K i\nD) and \u201cUpSample\u201d denotes the feature-attentional interpolation discussed in sect.\n4.3. We use three-scales features with 1/2, 1/4, 1/8 factors of input image width and height for inter-\npolation, and these features are outputs of conv1, conv2, conv5 of ResNet-101 [15]. KD, Kgt, \u02c6Kgt\nare all with elements in [0, 1] and ones on their diagonals, therefore it is easy to verify that Llr and\nLhr are minimized when their two variables, i.e., similarity matrices, are exactly same.\n\nTraining details. The spec-diff-net is a deep architecture with differentiable building blocks.\nWe train it on BSD500 dataset [28] by auto-differentiation, and each image has multiple human\nlabeled boundaries. From these boundaries, each image can be segmented into regions. Compared\nwith semantic segmentation labels, the segmentation labels of BSD500 do not indicate semantic\ncategorization for pixels, and only indicate that pixels in a segment are grouped based on human\u2019s\nobservation. To speed up the training process, we \ufb01rst pre-train our spec-diff-net using LR kernel\nmatching loss, then add the HR segment matching loss which is more computational expensive due\nto the up-sampling by feature-attentional interpolation. We use ResNet-101 (excluding classi\ufb01cation\nlayer) pre-trained on MS-COCO [33] as in [20] for feature extraction and train spec-diff-net in\n160000 steps. Since components of spec-diff-net are differentiable, we learn parameters \u0398 of features\nextractor, \u00b5, t, \u03b3 in Eqs. (1,4,6), and \u03c4 in KD. We empirically found that eigenvalues of transition\nmatrix P decrease fast from maximal value of one, we therefore set Ne = 50 in approximation of\nspectral decomposition for covering dominant spectrum. U0 in simultaneous iteration is initialized\nby Ne columns of one-hot vectors with ones uniformly located on feature grid. The neighborhood\nwidth when computing W in Eq. (1) is set to 17 on feature grid. It takes 0.2 seconds to output neural\ndiffusion distance for an image in size of 321 \u00d7 481 on a GeForce GTX TITAN X GPU.\n\nIllustration of diffusion distance. Figure 2 illustrates examples of learned diffusion similarity\nmaps with respect to the pixels on image indicated by red points. Figure 2(a) shows that feature-\nattentional interpolation can up-sample neural diffusion similarity maps without aliasing artifacts.\nWe also tried a siamese network using Resnet-101 backbone as ours to learn pairwise similarity\nin embedded feature space (denoted as \u201cEmbedding\u201d), and it can be seen that our neural diffusion\ndistance is smooth and continuous, compared with \"Embedding\" method.\n\nEffects of parameters in approximate spectral decomposition. Table 1 presents training (300\nimages in \u201ctrain + val\u201d of BSD500 dataset) and test (200 images in \u201ctest\u201d of BSD500 dataset)\naccuracies measured by cosine similarity of estimated neural diffusion similarity matrix KD with\ntarget similarity matrix Kgt using different hyper-parameter T and initialized t in approximate\nspectral decomposition. Note that simultaneous iteration serves as a differentiable computational\n\n5\n\n\fTable 1: Effects of different parameters in approximate spectral decomposition.\n\n(T, t)\nTrain+val\nTest\n\n(1,5)\n0.778\n0.701\n\n(1,10)\n0.785\n0.709\n\n(2,5)\n0.785\n0.738\n\n(2,10)\n0.794\n0.741\n\n(3,5)\n0.777\n0.735\n\n(3,10)\n0.785\n0.748\n\nFigure 3: Visual comparison of similarity maps between deep embedding method and our neural\ndiffusion distance. Each map shows the similarities w.r.t. the central pixel in the image.\n\nblock in spec-diff-net which is end-to-end trained for minimizing \ufb01nal training loss. We observe that,\nincreasing initialization of t from 5 to 10 and iterations T from 1 to 2 all increase the training and test\naccuracies, but saturate after further increasing T and initialized t. In the followings, we set T = 2\nand initialize t = 10.\n\n6 Application to hierarchical image segmentation\n\nWe \ufb01rst apply neural diffusion distance to image segmentation. We train spec-diff-net on BSD500\n\u201ctrain\u201d and \u201cval\u201d sets, and test it on \u201ctest\u201d set. Given a test image I, KD is its neural diffusion\nsimilarity matrix measuring neural diffusion similarity between pairs of grid points. With KD, we\ndesign a hierarchical clustering algorithm for hierarchical image segmentation. The basic idea is to\n\ufb01rst identify a set of cluster centers, and then run the kernel k-means algorithm [9] with KD as the\nkernel to produce a \ufb01nest segmentation of image. Then we gradually aggregate these segments to\nderive a hierarchy of image segmentations. To initialize the cluster centers, we iteratively add a new\ncluster center with its diffusion similarity map best covering the residual coverage map 1 \u2212 Ucov with\nUcov \u2208 RN\u00d71 initialized as zeros. Speci\ufb01cally, we iteratively add cluster center by:\nD(1 \u2212 Ucov)}, C = C \u222a {i\u2217}, Ucov = min{Ucov + K i\u2217\n\ni\u2217 = argmaxi{K i\n(9)\nwhere K i\nD is the i-th row of KD, which is just the diffusion similarity map of i, and C is the set\nof cluster centers. The iteration stops until the residual coverage map is smaller than a threshold\n\nD , 1},\n\nFigure 4: Comparison of image segmentation results. (a) illustrates hierarchical image segmentation\nwith decreasing number of segments. (b) compares segmentation results by different methods.\n\n6\n\nInputEmbeddingOursInputEmbeddingOurs(b) Input(c) NCut-DF(d) DeepNCut(e) Ours-LR(f) Ours-HR(g) Human(a) One example of our hierarchical image segmentation results\fTable 2: Comparison of different segmentation methods.\n\nMethods NCut [31] NCut-DF DeepNCut [17] Ours-LR Ours-HR\nMAX\nAVR\n\n0.80\n0.69\n\n0.70\n0.60\n\n0.53\n0.44\n\n0.56\n0.48\n\n0.78\n0.68\n\n(0.02) in average on pixels. After segmenting image I to a set of segments with these initial centers\nby kernel k-means, we iteratively aggregate these segments by merging one pair of segments with\nlargest average feature similarity in each step until achieving a single cluster for the whole image. In\nthis way, we generate a hierarchy of segmentations with decreasing number of segments.\nIn Fig. 4, we illustrate an example of hierarchical image segmentation (Fig. 4(a)), and comparisons\nwith other segmentation methods, including normalized cut [31] using deep feature (NCut-DF),\ndeep normalized clustering (DeepNCut) [17], our methods w/o (Ours-LR) and with (Ours-HR)\nfeature-attentional interpolation of segmentation masks. The quantitative comparisons are shown\nin Tab. 2. Accuracy is measured by average (AVG) and best (MAX) covering metric under optimal\nimage scale criterion [28] as in [17]. Our algorithm achieves signi\ufb01cantly better accuracies on \u201ctest\u201d\nset of BSD500. For example, DeepNCut is a state-of-the-art deep spectral segmentation method\nbased on differentiable eigen-decomposition, and our method achieves nearly 0.1 higher in accuracy.\n\n7 Application to weakly supervised semantic segmentation\n\nWe also apply neural diffusion distance to weakly supervised semantic segmentation, i.e., learning to\nsegment given an image set with only image-level classi\ufb01cation labels. The basic idea is as follows.\nSince neural diffusion distance determines the similarities of each pixel w.r.t. other pixels on feature\ngrid, which can be taken as spatial guidance for localizing where is the object of interest in a weakly\nsupervised setting. Overall, we combine segmentation and classi\ufb01cation in a single network, and\ntrain the network only using class labels. This is achieved by designing an attention module guided\nby diffusion distance to generate \u201cpseudo\u201d segmentation maps, which are utilized for computing\nglobal image features by weighted average pooling using weights based on \u201cpseudo\u201d segmentation.\nThe global image features are taken as input of training loss to predict image class labels.\n\nFigure 5: The architecture of our weakly supervised segmentation network.\n\nAs shown in Fig. 5, given image I, we compute neural diffusion distance and similarity matrix\nKD \u2208 RN\u00d7N by spec-diff-net. We also use Resnet-101 to extract features F \u2208 RN\u00d7d from I. Then\nwe design an attention module using regional feature pooling (RFP) to generate pseudo segmentation\nprobability maps P \u2208 RN\u00d7c (c is number of classes). With pseudo segmentation maps, we compute\nper-category global features F gl by per-category weighted average pooling (PC-WAP) of F . Then\nfeatures of F gl are sent to training loss to predict image labels. We next introduce these components.\nRegional feature pooling (RFP). It performs average feature pooling over region determined by\ndiffusion distance for each pixel. We \ufb01rst generate binary spatial regional mask for each pixel on\nfeature grid, simply implemented in parallel for all pixels by thresholding diffusion similarity matrix\nKD by M = \u03b4[KD > \u00b5] \u2208 RN\u00d7N (\u00b5 is initialized as 0.5, \u03b4[\u00b7] is binary with value of 1 if its variable\nis true). Then we average-pool features in regional mask of each pixel, which can be implemented by\nFM = diag((M(cid:126)1)\u22121)M F , FM \u2208 RN\u00d7d. Therefore, for each pixel, this operation pools the features\nfor each pixel over the region of pixels around it with neural diffusion similarities larger than \u00b5.\n\n7\n\n\u2026\u2026Spec-diff-netResNet-101Attention by regional feature poolingPC-WAPAirplane?Bike?TV?RFP\u2026PREDCLASMulti-instance LossBinary Classification Loss\u2026\u2026\u2026Image features mapsGlobal image featureScores for different categories\fPseudo segmentation prediction (PRED). With the pooled features by RFP, we predict the per-pixel\nsegmentation probabilities by classi\ufb01er {H \u2208 Rd\u00d7c, b \u2208 Rc\u00d71}, i.e.,\n\ni\n\n; \u03b8i)],\n\nfor i = 1,\u00b7\u00b7\u00b7 , c,\n\nF gl\ni = F T [Softmaxsp(P sg\n\nP sg = Softmaxcl(FM H + (cid:126)1bT ),\n\ni \u2208 RN\u00d71 is the i-th column of P sg, Softmaxsp(P sg\n\n(10)\nwhere P sg \u2208 RN\u00d7c, Softmaxcl(\u00b7) is softmax across different categories. Therefore, the i-th column\nof P sg indicates the probability map of pixels belonging to i-th category.\nPer-category weighted average pooling (PC-WAP). Based on the \u201cpseudo\u201d segmentation proba-\nbility maps in P sg, we compute global image feature for i-th category by weighted average pooling:\n(11)\n, \u03b8i) \u2208 RN\u00d71 is softmax operator\nwhere P sg\nconducted spatially over feature grid with temperature \u03b8i. Different from global average pooling\n(GAP) in [38], we compute global image feature by weighted average pooling with weights based on\n\u201cpseudo\u201d segmentation probability maps in P sg, indicating which pixels are relevant to each class.\nTraining loss. In weakly supervised setting, we only have image-level class labels, we therefore\ndesign training loss only with the guidance of class labels. Given the globally pooled features using\nPC-WAP, we predict the probabilities of image belonging to different categories (i.e., \u201cCLAS\u201d block\nin Fig. 5) by P cl = {H T\ni=1, where Hi and bi (i = 1,\u00b7\u00b7\u00b7 , c) are respectively one column\nand element of H, b in \u201cPRED\u201d block. Then training loss is de\ufb01ned by binary cross-entropy (BCE):\n(12)\nmax \u2208 Rc is a vector with elements as maximal values of columns of P sg over feature grid\nwhere P sg\nfor different categories, therefore the second term is multiple instance loss. Minimizing Lws forces\nthe classi\ufb01er of H, b to predict correct image-level labels and pixel-level segmentation implicitly.\n\nLws = BCE(P cl, ycl) + BCE(P sg\n\ni + bi}c\n\nmax, ycl),\n\ni F gl\n\ni\n\nTable 3: Comparison of different weakly supervised semantic segmentation methods.\n\nMethods MIL [29]\nVal\nTest\n\n42.0\n\n-\n\nSaliency [27] RegGrow [16] RandWalk [34] AISI [10] Ours\n65.8\n66.3\n\n55.7\n56.7\n\n63.6\n64.5\n\n59.5\n\n59.0\n\n-\n\n-\n\nTable 4: Comparison with baseline semantic segmentation methods.\n\nMethods GAP [38] Embedding Ours (w/o RFP) Ours (w/o sharing) Ours\n65.8\nVal\n\n54.7\n\n45.2\n\n44.6\n\n64.7\n\nWe train weakly supervised segmentation network (spec-diff-net is \ufb01xed and pre-trained on 500\nimages of BSD500) on VOC 2012 segmentation training set with augmented data [13] using only\nimage labels. After training the network, we derive pseudo segmentation maps for training images,\nwhich are taken as segmentation labels for training another ResNet-101 for learning to segment. We\ntrain the nets on 321 \u00d7 321 patches with \ufb01xed batch normalization as pre-trained ResNet-101 due to\nlimited batch size. We apply trained segmentation net on \u201cval\u201d and \u201ctest\u201d of VOC 2012 segmentation\ndataset. The network is applied to a test image in multiple scales (scaling factors of 0.7, 0.85, 1) with\ncropped overlapping 321 \u00d7 321 patches, and these segmentation probabilities are averaged as the\n\ufb01nal prediction.\nTable 3 compares segmentation accuracies in mIoU with other weakly supervised segmentation\nmethods: multiple instance learning (MIL) [29], saliency-based method (Saliency) [27], region\ngrowing method (RegGrow) [16], random walking method (RandWalk) [34], and salient instances-\nbased method (AISI) [10]. Note that RandWalk method [34] is based on random walking for label\nprorogation given human labeled scribbles. AISI [10] depends on the instance-level salient object\ndetector trained on MS COCO dataset. We achieve 65.8% and 66.3% on \u201cval\u201d and \u201ctest\u201d sets, which\nare higher than state-of-the-art AISI method also using ResNet-101 and same training set. Figure 6\nshows examples of segmentation results (more results are in supplementary material).\nAblation study: As shown in Tab. 4, without regional feature pooling, i.e., ours (w/o RFP), the\naccuracy on \u201cval\u201d set decreases from 65.8 to 44.6. This shows that RFP is essential because it\n\n8\n\n\fFigure 6: Examples of semantic segmentation results by different methods.\n\nenforces that pixels with high neural diffusion similarities will have similar features, then they should\nbe grouped and have similar segmentation probabilities. Furthermore, without sharing the classi\ufb01ers\nfor classi\ufb01cation in training loss and segmentation in \u201cPRED\u201d module marginally decreases the result.\nWhen sharing classi\ufb01ers, by optimizing the training loss, it jointly enforces that the classi\ufb01er can\npredict global image class label and locations of objects of interest using the same classi\ufb01er. In\nTab. 4, we also report result using same weakly supervised segmentation architecture as ours but with\nsimilarity learned by embedding method, and the accuracy is signi\ufb01cantly lower that our method\nbased on diffusion distance.\n\n8 Conclusion and future work\n\nIn this work, we proposed a novel deep architecture for computing neural diffusion distance on\nimage based on approximate spectral decomposition and feature-attentional interpolation. It achieved\npromising results for hierarchical image segmentation and weakly supervised semantic segmentation.\nWe are interested to further improve the neural diffusion distance, e.g., better handling transparent\nobject boundaries, and apply it to more applications, e.g., image colorization, editing, labeling, etc.\n\nAcknowledgement. This work was supported by National Natural Science Foundation of China\nunder Grants 11971373, 11622106, 11690011, 61721002, U1811461.\n\nReferences\n[1] Francis R Bach and Michael I Jordan. Learning spectral clustering. In NeurIPS, pages 305\u2013312, 2004.\n\n[2] Matt Barnes and Artur Dubrawski. Deep spectral clustering for object instance segmentation. In ICLR\n\nWorkshop, 2018.\n\n[3] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected\n\nnetworks on graphs. In ICLR, 2014.\n\n[4] Siddhartha Chandra, Nicolas Usunier, and Iasonas Kokkinos. Dense and low-rank gaussian crfs using deep\n\nembeddings. In ICCV, pages 5103\u20135112, 2017.\n\n[5] Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, and Luc Van Gool. Blazingly fast video object segmentation\n\nwith pixel-wise metric learning. In CVPR, pages 1189\u20131198, 2018.\n\n9\n\n(b) Inputs(c) GAP(d) Ours (w/o RFP)(e) Ours(f) GT (a) Examples of \u201cpseudo\u201d segmentation probability maps by our methods w/o (middle) and with (right) regional feature pooling \f[6] Hai Ci, Chunyu Wang, and Yizhou Wang. Video object segmentation by learning location-sensitive\nembeddings. In Proceedings of the European Conference on Computer Vision (ECCV), pages 501\u2013516,\n2018.\n\n[7] Ronald R Coifman and St\u00e9phane Lafon. Diffusion maps. Applied and computational harmonic analysis,\n\n21(1):5\u201330, 2006.\n\n[8] Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, and\nSteven W Zucker. Geometric diffusions as a tool for harmonic analysis and structure de\ufb01nition of data:\nDiffusion maps. Proceedings of the national academy of sciences, 102(21):7426\u20137431, 2005.\n\n[9] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized\n\ncuts. In ACM SIGKDD, pages 551\u2013556, 2004.\n\n[10] Ruochen Fan, Qibin Hou, Ming-Ming Cheng, Gang Yu, Ralph R. Martin, and Shi-Min Hu. Associating\n\ninter-image salient instances for weakly supervised semantic segmentation. In ECCV, 2018.\n\n[11] Zeev Farbman, Raanan Fattal, and Dani Lischinski. Diffusion maps for edge-aware image editing. ACM\n\nTransactions on Graphics (TOG), 29(6):145, 2010.\n\n[12] John GF Francis. The qr transformation\u2014part 2. The Computer Journal, 4(4):332\u2013345, 1962.\n\n[13] Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic\n\ncontours from inverse detectors. In ICCV, 2011.\n\n[14] Adam W Harley, Konstantinos G Derpanis, and Iasonas Kokkinos. Segmentation-aware convolutional\n\nnetworks using local attention masks. In ICCV, pages 5038\u20135047, 2017.\n\n[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages\n\n770\u2013778, 2016.\n\n[16] Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu, and Jingdong Wang. Weakly-supervised semantic\n\nsegmentation network with deep seeded region growing. In CVPR, 2018.\n\n[17] Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix backpropagation for deep networks\n\nwith structured layers. In ICCV, pages 2965\u20132973, 2015.\n\n[18] Peng Jiang, Fanglin Gu, Yunhai Wang, Changhe Tu, and Baoquan Chen. Difnet: Semantic segmentation\n\nby diffusion networks. In NeurIPS, pages 1637\u20131646, 2018.\n\n[19] Shu Kong and Charless C Fowlkes. Recurrent pixel embedding for instance grouping. In CVPR, pages\n\n9018\u20139028, 2018.\n\n[20] Chen L, Papandreou G, Kokkinos I, Murphy K, and Yuille AL. Deeplab: Semantic image segmentation\nwith deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern\nanalysis and machine intelligence, 40(4):834\u2013848, 2018.\n\n[21] Anat Levin, Alex Rav-Acha, and Dani Lischinski. Spectral matting. IEEE transactions on pattern analysis\n\nand machine intelligence, 30(10):1699\u20131712, 2008.\n\n[22] Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, and Jan Kautz. Learning\n\naf\ufb01nity via spatial propagation networks. In NeurIPS, pages 1520\u20131530, 2017.\n\n[23] Everingham M, Eslami SA, Van Gool L, Williams CK, Winn J, and Zisserman A. The pascal visual object\n\nclasses challenge: A retrospective. IJCV, 111(1):98\u2013136, 2015.\n\n[24] Gal Mishne, Uri Shaham, Alexander Cloninger, and Israel Cohen. Diffusion nets. Applied and Computa-\n\ntional Harmonic Analysis, 2017.\n\n[25] Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, and Ronald R Coifman. Diffusion maps, spectral\n\nclustering and eigenfunctions of fokker-planck operators. In NeurIPS, pages 955\u2013962, 2006.\n\n[26] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In\n\nNeurIPS, pages 849\u2013856, 2002.\n\n[27] Seong Joon Oh, Rodrigo Benenson, Anna Khoreva, Zeynep Akata, Mario Fritz, and Bernt Schiele.\n\nExploiting saliency for object segmentation from image level labels. In CVPR, 2017.\n\n[28] Arbelaez P, Maire M, Fowlkes C, and Malik J. Contour detection and hierarchical image segmentation.\n\nIEEE transactions on pattern analysis and machine intelligence, 33(5):898\u2013916, 2011.\n\n10\n\n\f[29] P.O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In\n\nCVPR, 2015.\n\n[30] Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. Spectralnet: Spectral\n\nclustering using deep neural networks. In ICLR, 2018.\n\n[31] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE transactions on pattern\n\nanaylsis and machine intelligence, 22(8):888\u2013905, 2000.\n\n[32] Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. SIAM, 1997.\n\n[33] Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, and Zitnick CL. Microsoft coco:\n\nCommon objects in context. In ECCV, 2014.\n\n[34] Paul Vernaza and Manmohan Chandraker. Learning random-walk label propagation for weakly-supervised\n\nsemantic segmentation. In CVPR, 2017.\n\n[35] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR,\n\npages 7794\u20137803, 2018.\n\n[36] Ruixuan Yu, Jian Sun, and Huibin Li. Learning spectral transform network on 3d surface for non-rigid\n\nshape analysis. In ECCV-GMDL, pages 377\u2013394. Springer, 2018.\n\n[37] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural\n\nnetworks. In CVPR, 2015.\n\n[38] Bolei Zhou, Aditya Khosl, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for\n\ndiscriminative localization. In CVPR, pages 2921\u20132929, 2016.\n\n[39] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian \ufb01elds\n\nand harmonic functions. In ICML, pages 912\u2013919, 2003.\n\n11\n\n\f", "award": [], "sourceid": 827, "authors": [{"given_name": "Jian", "family_name": "Sun", "institution": "Xi'an Jiaotong University"}, {"given_name": "Zongben", "family_name": "Xu", "institution": "XJTU"}]}