{"title": "Arbicon-Net: Arbitrary Continuous Geometric Transformation Networks for Image Registration", "book": "Advances in Neural Information Processing Systems", "page_first": 3415, "page_last": 3425, "abstract": "This paper concerns the undetermined problem of estimating geometric transformation between image pairs. Recent methods introduce deep neural networks to predict the controlling parameters of hand-crafted geometric transformation models (e.g. thin-plate spline) for image registration and matching. However, the low-dimension parametric models are incapable of estimating a highly complex geometric transform with limited flexibility to model the actual geometric deformation from image pairs. To address this issue, we present an end-to-end trainable deep neural networks, named Arbitrary Continuous Geometric Transformation Networks (Arbicon-Net), to directly predict the dense displacement field for pairwise image alignment. Arbicon-Net is generalized from training data to predict the desired arbitrary continuous geometric transformation in a data-driven manner for unseen new pair of images. Particularly, without imposing penalization terms, the predicted displacement vector function is proven to be spatially continuous and smooth. To verify the performance of Arbicon-Net, we conducted semantic alignment tests over both synthetic and real image dataset with various experimental settings. The results demonstrate that Arbicon-Net outperforms the previous image alignment techniques in identifying the image correspondences.", "full_text": "Arbicon-Net: Arbitrary Continuous Geometric\nTransformation Networks for Image Registration\n\nNYU Multimedia and Visual Computing Lab\n\nNYU Multimedia and Visual Computing Lab\n\nNYU Multimedia and Visual Computing Lab\n\nNYU Multimedia and Visual Computing Lab\n\nJianchun Chen \u2217\n\nNew York University\nBrooklyn, NY 11201\njc7009@nyu.edu\n\nXiang Li\n\nNew York University\nBrooklyn, NY 11201\n\nxl845@nyu.edu\n\nLingjing Wang \u2217\n\nNew York University\nBrooklyn, NY 11201\nlw1474@nyu.edu\n\nYi Fang \u2020\n\nNew York University Abu Dhabi\n\nAbu Dhabi, UAE\nyfang@nyu.edu\n\nAbstract\n\nThis paper concerns the undetermined problem of estimating geometric transfor-\nmation between image pairs. Recent methods introduce deep neural networks\nto predict the controlling parameters of hand-crafted geometric transformation\nmodels (e.g. thin-plate spline) for image registration and matching. However the\nlow-dimension parametric models are incapable of estimating a highly complex\ngeometric transform with limited \ufb02exibility to model the actual geometric deforma-\ntion from image pairs. To address this issue, we present an end-to-end trainable\ndeep neural networks, named Arbitrary Continuous Geometric Transformation\nNetworks (Arbicon-Net), to directly predict the dense displacement \ufb01eld for pair-\nwise image alignment. Arbicon-Net is generalized from training data to predict the\ndesired arbitrary continuous geometric transformation in a data-driven manner for\nunseen new pair of images. Particularly, without imposing penalization terms, the\npredicted displacement vector function is proven to be spatially continuous and\nsmooth. To verify the performance of Arbicon-Net, we conducted semantic align-\nment tests over both synthetic and real image dataset with various experimental\nsettings. The results demonstrate that Arbicon-Net outperforms the previous image\nalignment techniques in identifying the image correspondences.\n\n1\n\nIntroduction\n\nImage registration plays a fundamental role in many computer vision applications such as medical\nimage processing [1], camera pose estimation [2], visual tracking [3]. Fig.1 shows the image\nregistration process, which includes geometric transformation estimation and image warping. To\nformulate the problem of image registration, traditional methods often approach the task in two steps:\n1) they \ufb01rstly compute the hand-crafted image features such as SIFT and HOG [4, 5] to capture\npixel-level descriptions, 2) and then iteratively search the optimal geometric transformation model\nto register a pair of images, driven by minimizing an alignment loss function. The alignment loss\nis usually pre-de\ufb01ned as a certain type of similarity metric (e.g. correlation scores) between two\n\n\u2217Equal contribution to this paper\n\u2020Corresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Illustration of Arbicon-Net for image alignment.\n\nsets of image feature descriptors. Previous efforts [6, 7, 8] have achieved great success in image\nregistration through the development of a variety of image feature descriptors and optimization\nalgorithms as summarized in [9]. However, they often face challenges posed by various deteriorated\nimage conditions such as 1) the dramatic image appearance variation (i.e. texture, color, lighting\nchanges and so on) between image pairs, and 2) the signi\ufb01cant geometric structural variation between\nimage pairs.\nThe recent success of deep neural network motivates researchers [10, 11, 12, 13] to develop deep\nlearning techniques to combine both two steps into an end-to-end trainable network, which aims to\nlearn a pre-de\ufb01ned geometric model (i.e. af\ufb01ne or thin-plate spline) through the regression process\nsupervised by minimizing the image matching loss. With the generalization from training data, those\nmethods are able to predict real-time image matching that is robust to various deteriorated image\nconditions. However, it is suggested by the authors [14] that pre-de\ufb01ned geometric transformation\nmodels only represent a set of low dimension transformations which prevents these methods from\npredicting complex geometric transformations for high-quality image registration. Moreover, the\ntransformations described by hand-crafted geometric models might not reveal the actual transfor-\nmation required for image alignment, which leads to a sub-optimal estimation of desired geometric\ntransformations.\nSome methods [14, 15, 16] tackle this problem by directly estimating semantic \ufb02ow from pixel-\nlevel features. These methods are more \ufb02exible to transfer the keypoints of images to semantically\ncorrelated positions. However, since the \ufb02ow \ufb01eld is estimated entirely by local features without\nintegrating global motion, local points are unable to move coherently, which consequently generates\ndistorted unrealistic images. In real-world applications (e.g. [1]), these \ufb02ow based methods require\nexplicitly imposed penalization to constrain the smoothness of \ufb02ow \ufb01eld.\nTo address the above mentioned issues, we propose to develop a novel geometric transformation\nnetwork, named arbitrary continuous geometric transformation networks (Arbicon-Net), to directly\npredict the dense displacement \ufb01eld that is not formulated by pre-de\ufb01ned hand-crafted geometric\nmodels. Compared with geometric model based approaches, Arbicon-Net uses deep neural network\nto model geometric transformations to accommodate arbitrary complex transformations required for\nthe registration of image pairs. Compared with semantic \ufb02ow based methods, Arbicon-Net features\nan attractive property, which predicts a smooth displacement \ufb01eld. As shown in Fig.2, we design\nan Arbicon-Net to simultaneously train three major modules, namely front-end geometric feature\nextractor module, transformation descriptor encoder module and displacement \ufb01eld predictor module,\nin an end-to-end fashion. The Arbicon-Net \ufb01rstly extracts dense feature maps from input image\npairs and encodes the discriminative local feature correlation into a transformation descriptor. The\nfollowing predictor module uses the transformation descriptor to decode displacement \ufb01eld for image\nregistration.\nContributions. We have three main contributions in this paper. First, we design a novel Arbicon-Net,\nwhich uses deep neural networks to predict dense displacement \ufb01eld to accommodate the arbitrary\ngeometric transformations according to the actual requirement for image registration. This addresses\nthe critical issue that the actual desired geometric transformation does not match with the one that can\nbe provided by pre-de\ufb01ned geometric model. Second, we prove that the Arbicon-Net is guaranteed to\ngenerate spatially continuous and smooth displacement \ufb01eld without imposing additional penalization\nterm as a smoothness constraint. Finally, we show that our proposed Arbicon-Net achieved superior\nperformance against hand-crafted geometric transformation models with both strong and weak\nsupervision.\n\n2\n\n\fFigure 2: Main Pipeline. Our proposed end-to-end trainable Arbicon-Net has three main compo-\nnents. 1) Geometric Feature Extractor Module; 2) Transformation Descriptor Encoder Module; 3)\nDisplacement Field Predictor Module.\n\n2 Related Works\n\nImage Registration. Image registration is de\ufb01ned as a process to determine a smooth geometric\ntransformation between input image pairs, especially for 2D/3D medical images. Existing methods\nsearch optimal geometric transformation by iteratively minimizing alignment loss, which is typically\nde\ufb01ned by the feature similarity or hierarchically de\ufb01ned intensity pattern. To achieve a high-quality\nimage registration, researchers [17, 1, 9] have explored diverse geometric transformation models,\nimage similarity metrics and searching algorithms.\nNon-learning based Image Correspondence Matching. The classic image correspondence match-\ning pipeline [6, 7, 8] starts by detecting key points via hand-crafted pixel-level feature descriptors\n[4, 5, 18], followed by feature matching strategies to determine the optimal point correspondence\n[19, 5]. Following researches have developed various hand-crafted algorithms [19, 20] to remove\nincorrect matches by searching global transformation or utilizing neighbor information. While these\nmethods are limited in matching speed and matching performance, they have far-reaching impact on\ncomputer vision society by imposing a standard pipeline and introducing geometric transformation\nestimation as a mainstream approach for image matching problem.\nLearning based Image Correspondence Matching. Inspired by the success of deep neural network,\npioneer works [21, 22, 23] propose to use pre-trained convolutional neural networks (CNNs) instead\nof hand-crafted ones to extract discriminative pixel-wise feature descriptors. Following researches\ndevelop learnable feature extraction layer [24, 25] and learnable feature matching layer [26, 27] with\ndifferentiable image alignment loss. Han et al. [28] introduce a fully learnable image correspondence\nmatching strategy over region proposals. However, this method is not in an end-to-end trainable\nfashion.\nMore recently, researchers [10, 11, 12, 13] propose end-to-end trainable network architectures for\nimage correspondence estimation. Speci\ufb01cally, these methods de\ufb01ne a regression network to predict\nthe parameter of speci\ufb01c geometric transformation models (i.e. thin-plate spline, af\ufb01ne). But they are\nlimited by the use of low dimension geometric models and consequently less capable of performing\n\ufb01ne-grained image geometric transformation. Other researchers either recurrently regress pixel-\nlevel \ufb02ow \ufb01eld [15, 16] to approximate \ufb01ne-grained image transformation or determine \ufb02ow \ufb01eld\nby neighbourhood consensus assignment [14]. However, as we stated above, they don\u2019t take the\nsmoothness of displacement \ufb01eld into account.\n\n3 Approach\n\n3.1 Geometric Feature Extractor\n\n3\n\n\fFollowing the common image matching paradigms [11], our Arbicon-Net\nstarts with extracting geometric features from input image pair IA, IB. We\n\ufb01rstly leverage a share-weighted CNN to generate a representative feature\nmap F \u2208 Rh\u00d7w\u00d7c for each input image, where at each location the feature\nvector fij \u2208 Rc represents local semantic information.\nIn order to estimate the geometric transformation of given image pairs, we\nestablish the local feature correlations between two feature maps by using the\nnormalized cosine similarity. For each local descriptor from FA, we compute\nits similarity score with all local descriptors in FB to form a 4-D correlation\ntensor S \u2208 Rh\u00d7w\u00d7h\u00d7w as shown in Fig.3. Each element sijkl \u2208 S is\ncomputed as,\n\n(cid:10)f A\n\n(cid:11)\nij , f B\nkl\nkl||2\nij||2||f B\n\nFigure 3: Feature Cor-\nrelation.\n\n(1)\nwhere (cid:104)\u00b7(cid:105) denotes the inner product of two vectors, the denominator acts a normalization term to\nfurther amplify con\ufb01dent matching and reduce ambiguity matching.\n\nsijkl =\n\n||f A\n\n3.2 Transformation Descriptor Encoder\n\nTo learn a more discriminative feature correlation, we leverage 4-D convolutional neural networks\n(CNNs) to re\ufb01ne correlation tensor S by using neighbor information [14]. The 4-D convolution layers\nintegrate additional neighborhood information compared with regular 2-D CNNs. Since the order\nof input image pairs (IA,IB) or (IB,IA) do not in\ufb02uence the result of local feature correlation, the\nconvolution operation is symmetrically applied, formulated as,\n\nSC = Conv(S) + (Conv(ST ))T\n\n(2)\nijkl = sklij. Moreover, we normalize the learned\n, where the transpose of S is computed according to sT\nhwpq}. This\n4-D correlation tensor s by Eq.3, where \u03c6A\nnormalization encourages the bilateral con\ufb01dence of correlated pairs for from source image and target\nimage.\n\npqhw} and \u03c6B\n\npq = {sc\n\npq = {sc\n\npq11, ..., sc\n\n11pq, ..., sc\n\n\u02c6sijkl =\n\nsc\nijkl\n\nsc\nijkl\n\nmax(\u03c6A\nij)\n\nmax(\u03c6B\nkl)\n\nsc\nijkl\n\n(3)\n\nSince our goal is to \ufb01nd a global transformation, we use a Multi-Layer Perceptron to encode\nlearned 4-D tensor \u02c6S into a transformation descriptor dAB \u2208 Rm that represents the overall image\ncorrespondence information, as shown in Eq.4. For global geometric transformation learning,\nthe image correspondence information describes a geometric transformation that optimally aligns\ncorresponding points on two images.\n\ndAB = M LP ( \u02c6S)\n\n(4)\n\n3.3 Displacement Field Predictor\nIn general, the geometric transformation T for each point x in a point set X \u2282 R2 can be de\ufb01ned as:\n(5)\n\nT (x, v) = x + v(x)\n\n, where v : R2 \u2192 R2 is a \u201cpoint displacement\u201d function.\nThe image registration task can be formulated as a process of\ndetermining the displacement function v. It is necessary for func-\ntion v to be a continuous and smooth function according to the\nMotion Coherent Theory (MCT) [29]. Fortunately, by leveraging\ndeep neural network architecture, we can construct the suitable\ndisplacement function v which satis\ufb01es the continuous and smooth\ncharacteristics.\nAs illustrated in Fig.4, given n 2-D points in the source image\nplane, we duplicate the transformation descriptor dAB for n times.\n\n4\n\nFigure 4: Displacement Field\nPredictor Module.\n\n\fEach point is concatenated with the m-D global descriptor dAB. We further construct a Displacement\nField Predictor network with four successive MLPs to decode the concatenated (m + 2)-D vector into\n2-D displacement vector. We de\ufb01ned this neural network structure in Eq.6 as F(\u00b7) : Rm+2 \u2192 R2,\nformulated as,\n\nv(x) = F([x, dAB])\n\n(6)\n\n,where [\u00b7] indicates concatenation operation.\nFurthermore, we brie\ufb02y prove the continuity and smoothness of our displacement \ufb01eld predictor F to\nbe used as our deep learning-based solution for the displacement function v.\nContinuity. Since both MLP and activation function \u03c3 are continuous, the continuity of Displacement\nField Predictor network can be trivially proven as a composite of continuous functions. Since dAB\nis concatenated to each point in X, this concatenation operation does not change the continuity\nfor displacement function v(\u00b7) = F([\u00b7, dAB]). In contrast, commonly used learning paradigms\n[1, 15, 16], which directly map high dimension feature space to 2-D displacement \ufb01eld, output a set of\ndiscrete displacement vectors, while the displacements of other points need to be further interpolated.\nSmoothness. After choosing a smooth function SoftPlus [30] as the activation function in our Dis-\nplacement Field Predictor network, it becomes trivial to estimate its complexity and smoothness since\nthe displacement function is a composite of a number of smooth functions (MLP and SoftPlus). In\npractice, Regularization Theory (RT) [31] uses the oscillatory behavior of a function to further mea-\nsure the smoothness of displacement function. The oscillatory behavior is measured by Reproducing\nKernel Hilbert Space (RKHS) [31, 32] in Eq.7.\n||v||2Hm =\n\n(cid:90)\n\n(7)\n\nds\n\n|\u02dcv(s)|2\n\u02dcg(s)\n\nRD\n\n, where \u02dcv is the Fourier transform of the displacement function v and \u02dcg is a low-pass \ufb01lter. In other\nwords, a smoother displacement function has considerably less energy in high frequency domain. We\ngenerally express models that regress pixel-level displacement vector, including Arbicon-Net and\nRTNs [16], as a composite function v in Eq.8.\n\nv(x) = F(G(x)) : R2 \u2192 R2\n\n(8)\nFunction G and F denotes the point feature encoding network and point displacement vector regres-\nsion network respectively. Speci\ufb01cally, we have G(x) = [x, dAB] in Arbicon-Net. In contrast, in\nRTNs G generates a high-dimensional feature map by sequential CNNs. Therefore, the G in RTNs\nis generally considered as a sparse and oscillate function, especially when the dimension of feature\nvector (output of function G) is high, which causes the widely known \u201ccurse of dimensionality\u201d\nproblem. In this section, we assume that two models have same regression network F and the input\nof function F are normalized to a same scale.\nAccording to [33], the Fourier transform of the composite function v has essentially maximum\nfrequency uv as\n\nx\n\nuv = uF max\n\n(9)\n, where uF is the maximum frequency of F, which is independent of G. Assume that the outputs of\ndifferent function G are in a same scale, the oscillate function G tends to have a larger maximum value\nof |G(cid:48)(x)| compared with linear function in Arbicon-Net. As a result, our composite function has a\nsmaller uv, which is likely to have lower energy in high frequency domain, which further guarantees\na smoother displacement function.\nBased on our proposed paradigm, we further constrain the smoothness of function F. Fortunately,\ngiven the popularity of deep learning models, the recent research community has been proposing\nregularization strategies, which naturally help our Displacement Field Predictor network to reduce the\nrisk of the oscillatory of displacement function. One simple solution is to design a proper network size.\nIn section 4.3, we provide empirical results to validate the smoothness of our estimated displacement\nfunction compared with non-rigid geometric transformation models.\n\n|G(cid:48)(x)|\n\n3.4 Loss functions\n\nAs shown in the right box of Fig.2, our designed method is designed to learn geometric transformation\nunder either strong supervision or weak supervision.\n\n5\n\n\fFor strongly-supervised loss, we have point correspondence information of x \u2208 X from source\nplane and y \u2208 Y from target plane. Lstrong directly minimizes the pairwise L2 distance between\ncorresponding points in transformed image plane and target image plane, as shown in Eq.10.\n\nN(cid:88)\n\ni=1\n\nLstrong =\n\n1\nN\n\n||T (xi) \u2212 yi||2\n\n2\n\n(10)\n\nLweak = \u2212 (cid:88)\n\ni,j,k,l\n\nFor weakly-supervised loss, we maximize the inner product of corresponding location in transformed\nsource feature map T (FA) and target feature map FB following the paradigm described in [10].\nLet T (IA) and IB to be matched, we have T (IA)ij and I ij\nB to be semantically matched, thus\n\nto be maximum. We implement the loss function described in Eq.11.\n\n(cid:68)T (fA)ij \u00b7 f ij\n\nB\n\n(cid:69)\n\nsijkl1d(T (i,j),(k,l))