{"title": "Matching neural paths: transfer from recognition to correspondence search", "book": "Advances in Neural Information Processing Systems", "page_first": 1205, "page_last": 1214, "abstract": "Many machine learning tasks require finding per-part correspondences between objects. In this work we focus on low-level correspondences --- a highly ambiguous matching problem. We propose to use a hierarchical semantic representation of the objects, coming from a convolutional neural network, to solve this ambiguity. Training it for low-level correspondence prediction directly might not be an option in some domains where the ground-truth correspondences are hard to obtain. We show how transfer from recognition can be used to avoid such training. Our idea is to mark parts as \"matching\" if their features are close to each other at all the levels of convolutional feature hierarchy (neural paths). Although the overall number of such paths is exponential in the number of layers, we propose a polynomial algorithm for aggregating all of them in a single backward pass. The empirical validation is done on the task of stereo correspondence and demonstrates that we achieve competitive results among the methods which do not use labeled target domain data.", "full_text": "Matching neural paths: transfer from recognition to\n\ncorrespondence search\n\nNikolay Savinov1\n\nLubor Ladicky1\n\nMarc Pollefeys1,2\n\n1Department of Computer Science at ETH Zurich, 2Microsoft\n\n{nikolay.savinov,lubor.ladicky,marc.pollefeys}@inf.ethz.ch\n\nAbstract\n\nMany machine learning tasks require \ufb01nding per-part correspondences between\nobjects. In this work we focus on low-level correspondences \u2014 a highly ambiguous\nmatching problem. We propose to use a hierarchical semantic representation of\nthe objects, coming from a convolutional neural network, to solve this ambiguity.\nTraining it for low-level correspondence prediction directly might not be an option\nin some domains where the ground-truth correspondences are hard to obtain. We\nshow how transfer from recognition can be used to avoid such training. Our idea is\nto mark parts as \u201cmatching\u201d if their features are close to each other at all the levels\nof convolutional feature hierarchy (neural paths). Although the overall number\nof such paths is exponential in the number of layers, we propose a polynomial\nalgorithm for aggregating all of them in a single backward pass. The empirical\nvalidation is done on the task of stereo correspondence and demonstrates that we\nachieve competitive results among the methods which do not use labeled target\ndomain data.\n\n1\n\nIntroduction\n\nFinding per-part correspondences between objects is a long-standing problem in machine learning.\nThe level at which correspondences are established can go as low as pixels for images or millisecond\ntimestamps for sound signals. Typically, it is highly ambiguous to match at such a low level: a pixel\nor a timestamp just does not contain enough information to be discriminative and many false positives\nwill follow. A hierarchical semantic representation could help to solve the ambiguity: we could\nchoose the low-level match which also matches at the higher levels. For example, a car contains a\nwheel which contains a bolt. If we want to check if this bolt matches the bolt in another view of the\ncar, we should check if the wheel and the car match as well.\nOne possible hierarchical semantic representation could be computed by a convolutional neural\nnetwork. The features in such a network are composed in a hierarchical manner: the lower-level\nfeatures are used to compute higher-level features by applying convolutions, max-poolings and\nnon-linear activation functions on them. Nevertheless, training such a convolutional neural network\nfor correspondence prediction directly (e.g., [25], [2]) might not be an option in some domains\nwhere the ground-truth correspondences are hard and expensive to obtain. This raises the question of\nscalability of such approaches and motivates the search for methods which do not require training\ncorrespondence data.\nTo address the training data problem, we could transfer the knowledge from the source domain where\nthe labels are present to the target domain where no labels or few labeled data are present. The most\ncommon form of transfer is from classi\ufb01cation tasks. Its promise is two-fold. First, classi\ufb01cation\nlabels are one of the easiest to obtain as it is a natural task for humans. This allows to create huge\nrecognition datasets like Imagenet [18]. Second, the features from the low to mid-levels have been\nshown to transfer well to a variety of tasks [22], [3], [15].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fAlthough there has been a huge progress in transfer from classi\ufb01cation to detection [7], [17], [19],\n[16], segmentation [12], [1] and other semantic reasoning tasks like single-image depth prediction\n[4], the transfer to correspondence search has been limited [13], [10], [8].\nWe propose a general solution to unsupervised transfer from recognition to correspondence search at\nthe lowest level (pixels, sound millisecond timestamps). Our approach is to match paths of activations\ncoming from a convolutional neural network, applied on two objects to be matched. More precisely,\nto establish matching on the lowest level, we require the features to match at all different levels of\nconvolutional feature hierarchy. Those different-level features form paths. One such path would\nconsist of neural activations reachable from the lowest-level feature to the highest-level feature in the\nnetwork topology (in other words, the lowest level feature lies in the receptive \ufb01eld of the highest\nlevel). Since every lowest-level feature belongs to many paths, we do voting based on all of them.\nAlthough the overall number of such paths is exponential in the number of layers and thus infeasible\nto compute naively, we prove that the voting is possible in polynomial time in a single backward\npass through the network. The algorithm is based on dynamic programming and is similar to the\nbackward pass for gradient computation in the neural network.\nEmpirical validation is done on the task of stereo correspondence on two datasets: KITTI 2012 [6]\nand KITTI 2015 [14]. We quantitatively show that our method is competitive among the methods\nwhich do not require labeled target domain data. We also qualitatively show that even dramatic\nchanges in low-level structure can be handled reasonably by our method due to the robustness of the\nrecognition hierarchy: we apply different style transfers [5] to corresponding images in KITTI 2015\nand still successfully \ufb01nd correspondences.\n\n2 Notation\n\nOur method is generally applicable to the cases where the input data has a multi-dimensional grid\ntopology layout. We will assume input objects o to be from the set of B-dimensional grids \u03a6 \u2282 RB\nand run convolutional neural networks on those grids. The per-layer activations from those networks\nwill be contained in the set of (B + 1)-dimensional grids \u03a8 \u2282 RB+1. Both the input data and the\nactivations will be indexed by a (B + 1)-dimensional vector x = (x, y, . . . , c) \u2208 NB+1, where x is a\ncolumn index, y is a row index, etc., and c \u2208 {1, . . . , C} is the channel index (we will assume C = 1\nfor the input data, which is a non-restrictive assumption as we will explain later).\nWe will search for correspondences between those grids, thus our goal will be to estimate shifts\nd \u2208 D \u2282 ZB+1 for all elements in the grid. The choice of the shift set D is task-dependent. For\nexample, for sound B = 1 and only 1D shifts can be considered. For images, B = 2 and D could be\na set of 1D shifts (usually called a stereo task) or a set of 2D shifts (usually called an optical \ufb02ow\ntask).\nIn this work, we will be dealing with convolutional neural network architectures, consisting of\nconvolutions, max-poolings and non-linear activation functions (one example of such an architecture\nis a VGG-net [20], if we omit softmax which we will not use for the transfer). We assume every\nconvolutional layer to be followed by a non-linear activation function throughout the paper and will\nnot specify those functions explicitly.\nThe computational graph of these architectures is a directed acyclic graph G = {A, E}, where\nA = {a1, . . . , a|A|\n} is a set of nodes, corresponding to neuron activations (|A| denotes the size of\nthis set), and E = {e1, . . . , e|E|\n} is a set of arcs, corresponding to computational dependencies (|E|\naj is the output (endpoint). The node set consists of disjoint layers A =(cid:83)L\ndenotes the size of this set). Each arc is represented as a tuple (ai, aj), where ai is the input (origin),\n(cid:96)=0 A(cid:96). The arcs are only\nallowed to go from the previous layer to the next one.\nWe will use the notation A(cid:96)(x) for the node in (cid:96)-th layer at position x; in(x(cid:96)) for the set of\norigins x(cid:96)\u22121 of arcs, entering layer (cid:96) at position x(cid:96) of the reference object; x(cid:96)+1 \u2208 out(x(cid:96)) for\nthe set of endpoints of arcs, exiting layer (cid:96) at position x(cid:96) of the reference object. Let f(cid:96) \u2208 F =\n{maxpool, conv} be the mathematic operator which corresponds to forward computation in layer\n(cid:96) as a \u2190 f(cid:96)(in(a)), a \u2208 A(cid:96) (with a slight abuse of notation, we use a for both the nodes in the\ncomputational graph and the activation values which are computed in those nodes).\n\n2\n\n\fFigure 1: Four siamese paths are shown. Two of them (red) have the same origin and support the\nhypothesis of the shift d = 3 for this origin. The other two (green and pink) have different origins\nand support hypotheses d = 3 and d = 2 for their respective origins.\n\n3 Correspondence via path matching\nWe will consider two objects, reference o \u2208 \u03a6 and searched o(cid:48) \u2208 \u03a6, for which we want to \ufb01nd\ncorrespondences. After applying a CNN on them, we get graphs G and G(cid:48) of activations. The goal is\nto establish correspondences between the input-data layers A0 and A(cid:48)0. That is, every cell A0(x) in\nthe reference object o \u2208 \u03a6 has a certain shift d \u2208 D in the searched object o(cid:48) \u2208 \u03a6, and we want to\nestimate d.\nHere comes the cornerstone idea of our method: we establish the matching of A0(x) with A(cid:48)0(x \u2212 d)\nfor a shift d if there is a pair of \u201cparallel\u201d paths (we call this pair a siamese path), originating at\nthose nodes and ending at the last layers AL, A(cid:48)L, which match. This pair of paths must have the\nsame spatial shift with respect to each other at all layers, up to subsampling, and go through the\nsame feature channels with respect to each other. We take the subsampling into account by per-layer\nfunctions\n\nk(cid:96)(d) = \u03b3(cid:96)(k(cid:96)\u22121(d)), (cid:96) = 1, . . . , L,\n\n\u03b3(cid:96)(\u02dcd) =\n\nq(cid:96)\n\n,\n\nk0(d) = d,\n\n(1)\n\n(cid:37)\n\n(cid:36) \u02dcd\n\nwhere k(cid:96)(d) is how the zero-layer shift d transforms at layer (cid:96), q(cid:96) is the (cid:96)-th layer spatial subsampling\nfactor (note that rounding and division on vectors is done element-wise). Then a siamese path P can\nbe represented as\n\n0 \u2212 k0(d)), . . . , A(cid:48)L(xP\n\nL \u2212 kL(d)))\n\nP = (p, p(cid:48)), p = (A0(xP\n\n0 ), . . . , AL(xP\n\nL )), p(cid:48) = (A(cid:48)0(xP\n\n0 = x and xP\n\n(2)\nwhere xP\n(cid:96) denotes the position at which the path P intersects layer (cid:96) of the reference\nactivation graph. Such paths are illustrated in Fig. 1. The logic is simple: matching in a siamese\npath means that the recognition hierarchy detects the same features at different perception levels with\nthe same shifts (up to subsampling) with respect to the currently estimated position x, which allows\nfor a con\ufb01dent prediction of match. The fact that a siamese path is \u201cmatched\u201d can be established by\ncomputing the matching function (high if it matches, low if not)\n\n(cid:96) \u2212 k(cid:96)(d)))\n\n(cid:96)=0\n\nM (P ) =\n\nm(cid:96)(A(cid:96)(xP\n\n(cid:96) ), A(cid:48)(cid:96)(xP\n\n(3)\nwhere m(cid:96)(\u00b7,\u00b7) is a matching function for individual neurons (prefers them both to be similar and\nnon-zero at the same time) and (cid:12) is a logical-and-like operator. Both will be discussed later.\nSince we want to estimate the shift for a node A0(x), we will consider all possible shifts and vote for\neach of them. Let us denote a set of siamese paths, starting at A(cid:96)(x) and A(cid:48)(cid:96)(x \u2212 d) and ending at\nthe last layer, as P(cid:96)(x, d).\nFor every shift d \u2208 D we introduce U (x, d) as the log-likelihood of the event that d is the correct\nshift, i.e. A0(x) matches A(cid:48)0(x \u2212 d). To collect the evidence from all possible paths, we \u201csum up\u201d\n\nL(cid:75)\n\n3\n\nInputConvolutionMax-poolingConvolutionMax-poolingk1(d)=3k1(d)=3k1(d)=2k0(d)=3k0(d)=3k0(d)=2k2(d)=1k2(d)=1k2(d)=1k3(d)=1k3(d)=1k3(d)=1k4(d)=0k4(d)=0k4(d)=0ReferenceGSearchedG\u2032Shifts\fthe matching functions for all individual paths, leading to\n\n(cid:77)\n\n(cid:77)\n\nL(cid:75)\n\nU (x, d) =\n\nM (P ) =\n\nm(cid:96)(A(cid:96)(xP\n\n(cid:96) ), A(cid:48)(cid:96)(xP\n\n(cid:96) \u2212 k(cid:96)(d)))\n\n(4)\n\n(cid:96)=0\n\nP\u2208P0(x,d)\n\nP\u2208P0(x,d)\nwhere the sum-like operator \u2295 will be discussed later.\nThe distribution U (x, d) can be used to either obtain the solution as d\u2217(x) = arg maxd\u2208D U (x, d)\nor to post-process the distribution with any kind of spatial smoothing optimization and then again\ntake the best-cost solution.\nThe obvious obstacle to using the distribution U (x, d) is that\nObservation 1. If K is the minimal number of activation channels in all the layers of the network\nand L is the number of layers, the number of paths, considered in the computation of U (x, d) for a\nsingle originating node, is \u2126(K L) \u2014 at least exponential in the number of layers.\n\nIn practice, it is infeasible to compute U (x, d) naively. In this work, we prove that it is possible\nto compute U (x, d) in O(|A| + |E|) \u2014 thus linear in the number of layers \u2014 using the algorithm\nwhich will be introduced in the next section.\n\n4 Linear-time backward algorithm\nTheorem 1. For any m(cid:96)(\u00b7,\u00b7) and any pair of operators (cid:104)\u2295, (cid:12)(cid:105) such that (cid:12) is left-distributive over\n\u2295, i.e. a (cid:12) (b \u2295 c) = a (cid:12) b \u2295 a (cid:12) c, we can compute U (x, d) for all x and d in O(|A| + |E|).\nProof Since there is distributivity, we can use a dynamic programming approach similar to the one\ndeveloped for gradient backpropagation.\nFirst, let us introduce subsampling functions k(cid:96)\ns = ks as introduced in Eq. 1.\nk0\nThen, let us introduce auxiliary variables U(cid:96)(x(cid:96), d) for each layer (cid:96) = 0, . . . , L, which have the same\nde\ufb01nition as U (x, d) except for the fact that the paths, considered in them, start from the later layer (cid:96):\n\n(cid:96)(d) = d, s \u2265 (cid:96). Note that\n\ns\u22121(d)), k(cid:96)\n\ns(d) = \u03b3s(k(cid:96)\n\nU(cid:96)(x(cid:96), d) =\n\nM (P ) =\n\nP\u2208P(cid:96)(x(cid:96),d)\n\nP\u2208P(cid:96)(x(cid:96),d)\n\ns=(cid:96)\n\nms(As(xP\n\ns ), A(cid:48)s(xP\n\ns \u2212 k(cid:96)\n\ns(d))).\n\n(5)\n\nNote that U (x, d) = U0(x, d). The idea is to iteratively recompute U(cid:96)(x(cid:96), d) based on known\nU(cid:96)+1(x(cid:96)+1, \u03b3(cid:96)(d)) for all x(cid:96)+1. Eventually, we will get to the desired U0(x, d).\nThe \ufb01rst step is to notice that all the paths share the same pre\ufb01x and write it out explicitly:\n\nU(cid:96)(x(cid:96), d) =\n\n=\n\nP\u2208P(cid:96)(x(cid:96),d)\n\nP\u2208P(cid:96)(x(cid:96),d)\n\nms(As(xP\n\ns ), A(cid:48)s(xP\n\ns \u2212 k(cid:96)\n\ns=(cid:96)\n\nm(cid:96)(A(cid:96)(x(cid:96)), A(cid:48)(cid:96)(x(cid:96) \u2212 d)) (cid:12)\n\ns(d)))\n\n(cid:34) L(cid:75)\n\ns=(cid:96)+1\n\n(cid:35)\n\nms(As(xP\n\ns ), A(cid:48)s(xP\n\ns \u2212 k(cid:96)\n\ns(d)))\n\n.\n\n(6)\nNow, we want to pull the pre\ufb01x m(cid:96)(A(cid:96)(x(cid:96)), A(cid:48)(cid:96)(x(cid:96) \u2212 d)) out of the \u201csum\u201d. For that purpose, we\nwill need the set of endpoints out(x(cid:96)), introduced in the notation in Section 2. The \u201csum\u201d can be\n(cid:35)\nre-written in terms of those endpoints as\n\nm(cid:96)(A(cid:96)(x(cid:96)), A(cid:48)(cid:96)(x(cid:96) \u2212 d)) (cid:12)\n\nms(As(xP\n\ns ), A(cid:48)s(xP\n\ns \u2212 k(cid:96)\n\ns(d)))\n\n(cid:34) L(cid:75)\n\nU(cid:96)(x(cid:96), d) =\n\n(cid:77)\n\nL(cid:75)\n\n(cid:77)\n(cid:77)\n\n(cid:77)\n\n(cid:77)\n\nL(cid:75)\n\nx(cid:96)+1\u2208out(x(cid:96))\n\nP\u2208P(cid:96)+1(x(cid:96)+1,\u03b3(cid:96)+1(d))\n\ns=(cid:96)+1\n\n4\n\n.\n\n(7)\n\n\fUL(xL, d) \u2190 mL(AL(xL), A(cid:48)L(xL \u2212 d)),\n\n(cid:46) Initialize the last layer.\n\nend for\nfor (cid:96) = L-1, ..., 0 do\nfor A(cid:96)(x(cid:96)) \u2208 A(cid:96) do\nfor d \u2208 k(cid:96)(D) do\n\nAlgorithm 1 Backward pass\n1: procedure BACKWARD(G, G(cid:48))\nfor AL(xL) \u2208 AL do\n2:\nfor d \u2208 kL(D) do\n3:\n4:\nend for\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\nend for\n17:\nreturn U0\n18:\n19: end procedure\n\nend for\n\nend for\n\nS \u2190 0,\nfor x(cid:96)+1 \u2208 out(x(cid:96)) do\nend for\nU(cid:96)(x(cid:96), d) \u2190 m(cid:96)(A(cid:96)(x(cid:96)), A(cid:48)(cid:96)(x(cid:96) \u2212 d)) (cid:12) S,\n\nS \u2190 S \u2295 U(cid:96)+1(x(cid:96)+1, \u03b3(cid:96)+1(d)),\n\n(cid:46) Return the distribution for the \ufb01rst layer.\n\nThe last step is to use the left-distributivity of (cid:12) over \u2295 to pull the pre\ufb01x out of the \u201csum\u201d:\nU(cid:96)(x(cid:96), d) = m(cid:96)(A(cid:96)(x(cid:96)), A(cid:48)(cid:96)(x(cid:96) \u2212 d)) (cid:12)\n\ns ), A(cid:48)s(xP\n\nms(As(xP\n\nL(cid:75)\n\n(cid:77)\n= m(cid:96)(A(cid:96)(x(cid:96)), A(cid:48)(cid:96)(x(cid:96) \u2212 d)) (cid:12) (cid:77)\n\nx(cid:96)+1\u2208out(x(cid:96))\n\nP\u2208P(cid:96)+1(x(cid:96)+1,\u03b3(cid:96)+1(d))\n\ns=(cid:96)+1\n\nU(cid:96)+1(x(cid:96)+1, \u03b3(cid:96)+1(d)).\n\n(8)\n\ns \u2212 k(cid:96)\n\ns(d)))\n\nx(cid:96)+1\u2208out(x(cid:96))\n\nThe detailed procedure is listed in Algorithm 1. We use the notation k(cid:96)(D) for the set of subsampled\nshifts which is the result of applying function k(cid:96) to every element of the set of initial shifts D.\n5 Choice of neuron matching function m and operators \u2295, (cid:12)\nFor the convolutional layers, we use the matching function\n\n(cid:40)\n\nmconv(w, v) =\n\n0\nmin(w,v)\nmax(w,v)\n\nif w = 0, v = 0,\notherwise.\n\n(9)\n\nFor the max-pooling layers, the computational graph can be truncated to just one active connection\n(as only one element in\ufb02uences higher-level features). Moreover, max-pooling does not create any\nadditional features, only passes/subsamples the existing ones. Thus it does not make sense to take into\naccount the pre-activations for those layers as they are the same as activations (up to subsampling).\nFor these reasons, we use\n\nmmaxpool(w, v) = \u03b4(w = arg max Nw) \u2227 \u03b4(v = arg max Nv),\n\n(10)\nwhere Nw is the neighborhood of max-pooling covering node w, \u03b4(\u00b7) is the indicator function (1 if\nthe condition holds, 0 otherwise).\nIn this paper, we use sum as \u2295 and product as (cid:12). Another possible choice would be max for \u2295\nand min or product for (cid:12) \u2014 theoretically, those combinations satisfy the conditions in Theorem 1.\nNevertheless, we found sum/product combination working better than others. This could be explained\nby the fact that max as \u2295 would be taken over a huge set of paths which is not robust in practice.\n\n6 Experiments\n\nWe validate our approach in the \ufb01eld of computer vision as our method requires a convolutional\nneural network trained on a large recognition dataset. Out of the vision correspondence tasks, we\n\n5\n\n\fTable 1: Summary of the convolutional neural network VGG-16. We only show the part up to the 8-th\nlayer as we do not use higher activations (they are not pixel-related enough). In the layer type row, c\nstands for 3x3 convolution with stride 1 followed by the ReLU non-linear activation function [11] and\np for 2x2 max-pooling with stride 2. The input to convolution is padded with the \u201csame as boundary\u201d\nrule.\n\nLayer index\nLayer type\nOutput channels\n\n1\nc\n64\n\n2\nc\n64\n\n3\np\n64\n\n4\nc\n128\n\n5\nc\n128\n\n6\np\n128\n\n7\nc\n256\n\n8\nc\n256\n\nchose stereo matching to validate our method. For this task, the input data dimensionality is B = 2\nand the shift set is represented by horizontal shifts D = {(0, 0, 0), . . . , (Dmax, 0, 0)}. We always\nconvert images to grayscale before running CNNs, following the observation by [25] that color does\nnot help.\nFor pre-trained recognition CNN, we chose the VGG-16 network [20]. This network is summarized\nin Table 1. We will further refer to layer indexes from this table. It is important to mention that we\nhave not used the whole range of layers in our experiments. In particular, we usually started from\nlayer 2 and \ufb01nished at layer 8. As such, it is still necessary to consider multi-channel input. To\nextend our algorithm to this case, we create a virtual input layer with C = 1 and virtual per-pixel\narcs to all the real input channels. While starting from a later layer is an empirical observation which\nimproves the results for our method, the advantage of \ufb01nishing at an earlier layer was discovered by\nother researchers as well [5] (starting from some layer, the network activations stop being related to\nindividual pixels). We will thus abbreviate our methods as \u201cours(s, t)\u201d where \u201cs\u201d is the starting layer\nand \u201ct\u201d is the last layer.\n\n6.1 Experimental setup\n\nFor the stereo matching, we chose the largest available datasets KITTI 2012 and KITTI 2015. All\nimage pairs in these datasets are recti\ufb01ed, so correspondences can be searched in the same row.\nFor each training pair, the ground-truth shift is measured densely per-pixel. This ground truth was\nobtained by projecting the point cloud from LIDAR on the reference image. The quality measure is\nthe percentage Errt of pixels whose predicted shift error is bigger than a threshold of t pixels. We\nconsidered a range of thresholds t = 1, . . . , 5, while the main benchmark measure is Err3. This\nmeasure is only computed for the pixels which are visible in both images from the stereo pair.\nFor comparison with the baselines, we used the setup proposed in [25] \u2014 the seminal work which\nintroduced deep learning for stereo matching and which currently stays one of the best methods on the\nKITTI datasets. [24] is an extensive study which has a representative comparison of learning-based\nand non-learning-based methods under the same setup and open-source code [24] for this setup. The\nwhole pipeline works as follows. First, we obtain the raw scores U (x, d) from Algorithm 1 for the\nshifts up to Dmax = 228. Then we normalize the scores U (x,\u00b7) per-pixel by dividing them over\nthe maximal score, thus turning them into the range [0, 1], suitable for running the post-processing\ncode [24]. Finally, we run the post-processing code with exactly the same parameters as the original\nmethod [25] and measure the quality on the same 40 validation images.\n\n6.2 Baselines\n\nWe have two kinds of baselines in our evaluation: those coming from [25] and our simpler versions\nof deep feature transfer similar to [13], which do not consider paths.\nThe \ufb01rst group of baselines from [25] are the following: the sum of absolute differences \u201csad\u201d,\nthe census transform \u201ccens\u201d [23], the normalized cross-correlation \u201cncc\u201d. We also included the\nlearning-based methods \u201cfst\u201d and \u201cacrt\u201d [25] for completeness, although they use training data to\nlearn features while our method does not.\nFor the second group of baselines, we stack up the activation volumes for the given layer range\nand up-sample the layer volumes if they have reduced resolution. Then we compute normalized\ncross-correlation of the stacked features. Those baselines are denoted \u201ccorr(s, t)\u201d where \u201cs\u201d is the\n\n6\n\n\fTable 2: This table shows the percentages of erroneous pixels Errt for thresholds t = 1, . . . , 5\non the KITTI 2012 validation set from [25]. Our method is denoted \u201cours(2, 8)\u201d. The two right-\nmost columns \u201cfst\u201d and \u201cacrt\u201d correspond to learning-based methods from [25]. We give them for\ncompleteness, as all the other methods, including ours, do not use learning.\n\nThreshold\n\n1\n2\n3\n4\n5\n\nsad\n-\n-\n\n8.16\n\n-\n-\n\ncens\n\n4.90\n\n-\n-\n\n-\n-\n\nncc\n-\n-\n\n8.93\n\n-\n-\n\ncorr(1, 2)\n\nMethods\ncorr(2, 2)\n\ncorr(2, 8)\n\nours(2, 8)\n\n20.6\n10.5\n7.58\n6.19\n5.40\n\n20.4\n10.4\n7.52\n6.13\n5.36\n\n20.7\n8.14\n5.23\n4.02\n3.42\n\n17.4\n6.40\n3.94\n2.99\n2.49\n\nfst\n-\n-\n\n-\n-\n\n3.02\n\nacrt\n-\n-\n\n2.61\n\n-\n-\n\nTable 3: KITTI 2012 ablation study.\n\nMethods\n\nThreshold\n\nours(2, 2)\n\nours(2, 3)\n\ncentral(2, 8)\n\nours(2, 8)\n\n1\n2\n3\n4\n5\n\n17.7\n7.90\n5.28\n4.08\n3.41\n\n18.4\n8.16\n5.41\n4.05\n3.32\n\n17.3\n6.58\n4.02\n3.04\n2.53\n\n17.4\n6.40\n3.94\n2.99\n2.49\n\nstarting layer, \u201ct\u201d is the last layer. Note that we correlate the features before applying ReLU following\nwhat [25] does for the last layer. Thus we use the input to the ReLU inside the layers.\nAll the methods, including ours, undergo the same post-processing pipeline. This pipeline consists of\nsemi-global matching [9], left-right consistency check, sub-pixel enhancement by \ufb01tting a quadratic\ncurve, median and bilateral \ufb01ltering. We refer the reader to [25] for the full description. While the\n\ufb01rst group of baselines was tuned by [25] and we take the results from that paper, we had to tune the\npost-processing hyper-parameters of the second group of baselines to obtain the best results.\n\n6.3 KITTI 2012\n\nThe dataset consists of 194 training image pairs and 195 test image pairs. The re\ufb02ective surfaces like\nwindshields were excluded from the ground truth.\nThe results in Table 2 show that our method \u201cours(2, 8)\u201d performs better compared to the baselines.\nAt the same time, its performance is lower than learning-based methods from [25]. The main promise\nof our method is scalability: while we test it on a task where huge effort was invested into collecting\nthe training data, there are other important tasks without such extensive datasets.\n\n6.4 Ablation study on KITTI 2012\n\nThe goal of this section is to understand how important is the deep hierarchy of features versus one or\nfew layers. We compared the following setups: \u201cours(2, 2)\u201d uses only the second layer, \u201cours(2, 3)\u201d\nuses only the range from layer 2 to layer 3, \u201ccentral(2, 8)\u201d considers the full range of layers but only\nwith central arcs in the convolutions (connecting same pixel positions between activations) taken into\naccount in the backward pass, \u201cours(2, 8)\u201d is the full method. The result in Table 3 shows that it is\npro\ufb01table to use the full hierarchy both in terms of depth and coverage of the receptive \ufb01eld.\n\n6.5 KITTI 2015\n\nThe stereo dataset consists of 200 training image pairs and 200 test image pairs. The main difference\nto KITTI 2012 is that the images are colored and the re\ufb02ective surfaces are present in the evaluation.\nSimilar conclusions to KITTI 2012 can be drawn from experimental results: our method provides a\nreasonable transfer, being inferior only to learning-based methods \u2014 see Table 4. We show our depth\nmap results in Fig. 2.\n\n7\n\n\fTable 4: This table shows the percentages of erroneous pixels Errt for thresholds t = 1, . . . , 5\non the KITTI 2015 validation set from [25]. Our method is denoted \u201cours(2, 8)\u201d. The two right-\nmost columns \u201cfst\u201d and \u201cacrt\u201d correspond to learning-based methods from [25]. We give them for\ncompleteness, as all the other methods, including ours, do not use learning.\n\nThreshold\n\n1\n2\n3\n4\n5\n\nsad\n-\n-\n\n9.44\n\n-\n-\n\ncens\n\n5.03\n\n-\n-\n\n-\n-\n\nncc\n-\n-\n\n8.89\n\n-\n-\n\ncorr(1, 2)\n\nMethods\ncorr(2, 2)\n\ncorr(2, 8)\n\nours(2, 8)\n\n26.6\n10.9\n6.68\n5.05\n4.22\n\n26.5\n10.8\n6.63\n5.03\n4.20\n\n29.6\n11.2\n6.16\n4.42\n3.60\n\n26.2\n9.27\n4.78\n3.36\n2.72\n\nfst\n-\n-\n\n-\n-\n\n3.99\n\nacrt\n-\n-\n\n3.25\n\n-\n-\n\nFigure 2: Results on KITTI 2015. Top to bottom: reference image, searched image, our depth result.\nThe depth is visualized in the standard KITTI color coding (from close to far: yellow, green, purple,\nred, blue).\n\n6.6 Style transfer experiment on KITTI 2015\n\nThe goal of this experiment is to show the robustness of recognition hierarchy for the transfer to\ncorrespondence search \u2014 something we advocated in the introduction as the advantage of our\napproach. We apply the style transfer method [5], implemented in the Prisma app. We ran different\nstyle transfers on the left and right images. While now very different at the pixel level, the higher\nlevel descriptions of the images remain the same which allows to successfully run our method. The\nqualitative results show the robustness of our path-based method in Fig. 3 (see also Fig. 2 for visual\ncomparison to normal data).\n\nFigure 3: Results for the style transfer on KITTI 2015. Top to bottom: reference image, searched\nimage, our depth result. The depth is visualized in the standard KITTI color coding (from close to\nfar: yellow, green, purple, red, blue).\n\n8\n\n\f7 Conclusion\n\nIn this work, we have presented a method for transfer from recognition to correspondence search at\nthe lowest level. For that, we re-use activation paths from deep convolutional neural networks and\npropose an ef\ufb01cient polynomial algorithm to aggregate an exponential number of such paths. The\nempirical results on the stereo matching task show that our method is competitive among methods\nwhich do not use labeled data from the target domain. It would be interesting to apply this technique\nto sound, which should become possible once a high-quality deep convolutional model becomes\naccessible to the public (e.g., [21]).\n\nAcknowledgements\n\nWe would like to thank Dmitry Laptev, Alina Kuznetsova and Andrea Cohen for their comments\nabout the manuscript. We also thank Valery Vishnevskiy for running our code while our own cluster\nwas down. This work is partially funded by the Swiss NSF project 163910 \u201cEf\ufb01cient Object-Centric\nDetection\u201d.\n\nReferences\n[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder\n\narchitecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.\n\n[2] Christopher B Choy, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. Universal correspon-\n\ndence network. In Advances in Neural Information Processing Systems, pages 2414\u20132422, 2016.\n\n[3] J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang, E Tzeng, and T Darrell. Decaf: A deep convolutional\n\nactivation feature for generic visual recognition. corr abs/1310.1531 (2013), 2013.\n\n[4] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common\nmulti-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer\nVision, pages 2650\u20132658, 2015.\n\n[5] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint\n\narXiv:1508.06576, 2015.\n\n[6] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision\n\nbenchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.\n\n[7] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision,\n\npages 1440\u20131448, 2015.\n\n[8] Bumsub Ham, Minsu Cho, Cordelia Schmid, and Jean Ponce. Proposal \ufb02ow. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 3475\u20133484, 2016.\n\n[9] Heiko Hirschmuller. Accurate and ef\ufb01cient stereo processing by semi-global matching and mutual\ninformation. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society\nConference on, volume 2, pages 807\u2013814. IEEE, 2005.\n\n[10] Seungryong Kim, Dongbo Min, Bumsub Ham, Sangryul Jeon, Stephen Lin, and Kwanghoon Sohn. Fcss:\nFully convolutional self-similarity for dense semantic correspondence. arXiv preprint arXiv:1702.00926,\n2017.\n\n[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[12] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic seg-\nmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3431\u20133440, 2015.\n\n[13] Jonathan L Long, Ning Zhang, and Trevor Darrell. Do convnets learn correspondence? In Advances in\n\nNeural Information Processing Systems, pages 1601\u20131609, 2014.\n\n[14] Moritz Menze and Andreas Geiger. Object scene \ufb02ow for autonomous vehicles. In Conference on Computer\n\nVision and Pattern Recognition (CVPR), 2015.\n\n[15] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf:\n\nan astounding baseline for recognition (2014). arXiv preprint arXiv:1403.6382.\n\n[16] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Uni\ufb01ed, real-time\nobject detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 779\u2013788, 2016.\n\n9\n\n\f[17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\nwith region proposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\n\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211\u2013252, 2015.\n\n[19] Pierre Sermanet, David Eigen, Xiang Zhang, Micha\u00ebl Mathieu, Rob Fergus, and Yann LeCun. Over-\nfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint\narXiv:1312.6229, 2013.\n\n[20] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[21] A\u00e4ron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal\nKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio.\nCoRR abs/1609.03499, 2016.\n\n[22] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural\n\nnetworks? In Advances in neural information processing systems, pages 3320\u20133328, 2014.\n\n[23] Ramin Zabih and John Wood\ufb01ll. Non-parametric local transforms for computing visual correspondence.\n\nIn European conference on computer vision, pages 151\u2013158. Springer, 1994.\n\n[24] Jure Zbontar and Yann LeCun. MC-CNN github repository. https://github.com/jzbontar/mc-cnn,\n\n2016.\n\n[25] Jure Zbontar and Yann LeCun. Stereo matching by training a convolutional neural network to compare\n\nimage patches. Journal of Machine Learning Research, 17(1-32):2, 2016.\n\n10\n\n\f", "award": [], "sourceid": 808, "authors": [{"given_name": "Nikolay", "family_name": "Savinov", "institution": "ETH Zurich"}, {"given_name": "Lubor", "family_name": "Ladicky", "institution": "ETH Zurich"}, {"given_name": "Marc", "family_name": "Pollefeys", "institution": "ETH Zurich"}]}