{"title": "Robust Near-Isometric Matching via Structured Learning of Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1057, "page_last": 1064, "abstract": "Models for near-rigid shape matching are typically based on distance-related features, in order to infer matches that are consistent with the isometric assumption. However, real shapes from image datasets, even when expected to be related by almost isometric\" transformations, are actually subject not only to noise but also, to some limited degree, to variations in appearance and scale. In this paper, we introduce a graphical model that parameterises appearance, distance, and angle features and we learn all of the involved parameters via structured prediction. The outcome is a model for near-rigid shape matching which is robust in the sense that it is able to capture the possibly limited but still important scale and appearance variations. Our experimental results reveal substantial improvements upon recent successful models, while maintaining similar running times.\"", "full_text": "Robust Near-Isometric Matching via Structured\n\nLearning of Graphical Models\n\nJulian J. McAuley\n\nNICTA/ANU\n\njulian.mcauley\n@nicta.com.au\n\nTib\u00b4erio S. Caetano\n\nNICTA/ANU\n\ntiberio.caetano\n\n@nicta.com.au\n\nAbstract\n\nAlexander J. Smola\nYahoo! Research\u2217\nalex@smola.org\n\nModels for near-rigid shape matching are typically based on distance-related fea-\ntures, in order to infer matches that are consistent with the isometric assumption.\nHowever, real shapes from image datasets, even when expected to be related by\n\u201calmost isometric\u201d transformations, are actually subject not only to noise but also,\nto some limited degree, to variations in appearance and scale. In this paper, we\nintroduce a graphical model that parameterises appearance, distance, and angle\nfeatures and we learn all of the involved parameters via structured prediction. The\noutcome is a model for near-rigid shape matching which is robust in the sense that\nit is able to capture the possibly limited but still important scale and appearance\nvariations. Our experimental results reveal substantial improvements upon recent\nsuccessful models, while maintaining similar running times.\n\n1 Introduction\n\nMatching shapes in images has many applications, including image retrieval, alignment, and reg-\nistration [1, 2, 3, 4]. Typically, matching is approached by selecting features for a set of landmark\npoints in both images; a correspondence between the two is then chosen such that some distance\nmeasure between these features is minimised. A great deal of attention has been devoted to de\ufb01ning\ncomplex features which are robust to changes in rotation, scale etc. [5, 6].1\nAn important class of matching problems is that of near-isometric shape matching. In this setting,\nit is assumed that shapes are de\ufb01ned up to an isometric transformation (allowing for some noise),\nand therefore distance features are typically used to encode the shape. Recent work has shown how\nthe isometric constraint can be exploited by a particular type of graphical model whose topology\nencodes the necessary properties for obtaining optimal matches in polynomial time [11].\nAnother line of work has focused on structured learning to optimize graph matching scores, however\nno explicit exploitation of the geometrical constraints involved in shape modeling are made [12].\nIn this paper, we combine the best of these two approaches into a single model. We produce an\nexact, ef\ufb01cient model to solve near-isometric shape matching problems using not only isometry-\ninvariant features, but also appearance and scale-invariant features. By doing so we can learn the\nrelative importances of variations in appearance and scale with regard to variations in shape per\nse. Therefore, even knowing that we are in a near-isometric setting, we will capture the eventual\nvariations in appearance and scale into our matching criterion in order to produce a robust near-\nisometric matcher. In terms of learning, we introduce a two-stage structured learning approach to\naddress the speed and memory ef\ufb01ciency of this model.\n\n\u2217Alexander J. Smola was with NICTA at the time of this work.\n1We restrict our attention to this type of approach, i.e. that of matching landmarks between images. Some\n\nnotable approaches deviate from this norm \u2013 see (for example) [7, 8, 9, 10].\n\n1\n\n\fFigure 1: The graphical model introduced in [11].\n\n2 Background\n\n2.1 Shape Matching\n\n\u2018Shape matching\u2019 can mean many different things, depending on the precise type of query one is\ninterested in. Here we study the case of identifying an instance of a template shape (S \u2286 T ) in a\ntarget scene (U) [1].2 We assume that we know S, i.e. the points in the template that we want to\nquery in the scene. Typically both T and U correspond to a set of \u2018landmark\u2019 points, taken from a\npair of images (common approaches include [6, 13, 14]).\nFor each point t \u2208 T and u \u2208 U, a certain set of unary features are extracted (here denoted by \u03c6(t),\n\u03c6(u)), which contain local information about the image at that point [5, 6]. If y : S \u2192 U is a generic\nmapping representing a potential match, the goal is then to \ufb01nd a mapping \u02c6y which minimises the\naggregate distance between corresponding features, i.e.\n\n\u02c6y = f(S,U) = argmin\n\ny\n\nc1(si, y(si)), where c1(si, y(si)) = (cid:107)\u03c6(si) \u2212 \u03c6(y(si))(cid:107)2\n2.\n\n(1)\n\n(here (cid:107)\u00b7(cid:107)2 denotes the L2 norm). For injective y eq. (1) is a linear assignment problem, ef\ufb01ciently\nsolvable in cubic time. In addition to unary or \ufb01rst-order features, pairwise or second-order features\ncan be induced from the locations of the unary features. In this case eq. (1) would be generalised\nto minimise an aggregate distance between pairwise features. This however induces an NP-hard\nproblem (quadratic assignment). Discriminative structured learning has recently been applied to\nmodels of both linear and quadratic assignment in [12].\n\n|S|(cid:88)\n\ni=1\n\n2.2 Graphical Models\n\n|S|(cid:88)\n\nIn isometric matching settings, one may suspect that it may not be necessary to include all pairwise\nrelations in quadratic assignment. In fact a recent paper [11] has shown that if only the distances as\nencoded by the graphical model depicted in \ufb01gure 1 are taken into account (nodes represent points\nin S and states represent points in U), exact probabilistic inference in such a model can solve the\nisometric problem optimally. That is, an energy function of the following form is minimised:3\n\nc2(si, si+1, y(si), y(si+1)) + c2(si, si+2, y(si), y(si+2)).\n\n(2)\n\ni=1\n\nIn [11], it is shown that loopy belief propagation using this model converges to the optimal assign-\nment, and that the number of iterations required before convergence is small in practice.\nWe will extend this model by adding a unary term, c1(si, y(si)) (as in (eq. 1)), and a third-order\nterm, c3(si, si+1, si+2, y(si), y(si+1), y(si+2)). Note that the graph topology remains the same.\n\n2Here T is the set of all points in the template scene, whereas S corresponds to those points in which we\n\nare interested. It is also important to note that we treat S as an ordered object in our setting.\n\n3si+1 should be interpreted as s(i+1) mod |S| (i.e. the points form a loop).\n\n2\n\n\fy\n\n|S|(cid:88)\n\ni=1\n\n2.3 Discriminative Structured Learning\n\nIn practice, feature vectors may be very high-dimensional, and which components are \u2018important\u2019\nwill depend on the speci\ufb01c properties of the shapes being matched. Therefore, we introduce a\nparameter, \u03b8, which controls the relative importances of the various feature components. Note that\n\u03b8 is parameterising the matching criterion itself. Hence our minimisation problem becomes\n\n\u02c6y = f(S,U; \u03b8) = argmax\n\n(cid:104)h(S,U, y), \u03b8(cid:105)\n\nwhere h(S,U, y) = \u2212\n\n\u03a6(si, si+1, si+2, y(si), y(si+1), y(si+2)).\n\n(y is a mapping from S to U, \u03a6 is a third-order feature vector \u2013 our speci\ufb01c choice is shown in\nsection 3).4 In order to measure the performance of a particular weight vector, we use a loss func-\ntion, \u2206(\u02c6y, yi), which represents the cost incurred by choosing the assignment \u02c6y when the correct\nassignment is yi (our speci\ufb01c choice of loss function is described in section 4). To avoid over\ufb01tting,\nwe also desire that \u03b8 is suf\ufb01ciently \u2018smooth\u2019. Typically, one uses the squared L2 norm, (cid:107)\u03b8(cid:107)2\n2, to\npenalise non-smooth choices of \u03b8 [15].\nLearning in this setting now becomes a matter of choosing \u03b8 such that the empirical risk (average\nloss on all training instances) is minimised, but which is also suf\ufb01ciently \u2018smooth\u2019 (to prevent over-\n\n\ufb01tting). Speci\ufb01cally, if we have a set of training pairs,(cid:8)S 1 . . .S N(cid:9),(cid:8)U 1 . . .U N(cid:9), with labelled\nmatches(cid:8)y1 . . . yN(cid:9), then we wish to minimise\n(cid:123)(cid:122)\n\n\u2206(f(S i,U i; \u03b8), yi)\n\n(cid:107)\u03b8(cid:107)2\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n+ \u03bb\n2\n\nN(cid:88)\n\n1\nN\n\n(cid:124)\n\n.\n\n2\n\n(cid:125)\n\n(5)\n\nregulariser\n\ni=1\n\nempirical risk\n\nHere \u03bb (the regularisation constant) controls the relative importance of minimising the empirical risk\nagainst the regulariser. In our case, we simply choose \u03bb such that the empirical risk on our validation\nset is minimised.\nSolving (eq. 5) exactly is an extremely dif\ufb01cult problem and in practice is not feasible, since the\nloss is piecewise constant on the parameter \u03b8. Here we capitalise on recent advances in large-margin\nstructured estimation [15], which consist of obtaining convex relaxations of this problem. Without\ngoing into the details of the solution (see, for example, [15, 16]), it can be shown that a convex\nrelaxation of this problem can be obtained, which is given by\n\n(3)\n\n(4)\n\nN(cid:88)\n\ni=1\n\nmin\n\n\u03b8\n\n1\nN\n\n\u03bei + \u03bb\n2\n\n(cid:107)\u03b8(cid:107)2\n\n2\n\n(6a)\n\nsubject to\n(cid:104)h(S i,U i, yi) \u2212 h(S i,U i, y), \u03b8(cid:105) \u2265 \u2206(y, yi) \u2212 \u03bei\nfor all i and y \u2208 Y\n\n(6b)\n(where Y is the space of all possible mappings). It can be shown that for the solution of the above\ni \u2265 \u2206(f(S i,U i; \u03b8), yi). This means that we end up minimising an upper\nproblem, we have that \u03be\u2217\nbound on the loss, instead of the loss itself.\nSolving (6) requires only that we are able, for any value of \u03b8, to \ufb01nd\n\n(cid:0)(cid:104)h(S i,U i, y), \u03b8(cid:105) + \u2206(y, yi)(cid:1) .\n\nargmax\n\ny\n\n(7)\n\nIn other words, for each value of \u03b8, we are able to identify the mapping which is consistent with the\nmodel (eq. 3), yet incurs a high loss. This process is known as \u2018column generation\u2019 [15, 16]. As we\nwill de\ufb01ne our loss as a sum over the nodes, solving (eq. 7) is no more dif\ufb01cult than solving (eq. 3).\n\n4We have expressed (eq. 3) as a maximisation problem as a matter of convention; this is achieved simply\n\nby negating the cost function in (eq. 4).\n\n3\n\n\fFigure 2: Left: the (ordered) set of points in our template shape (S). Centre: connections between\nimmediate neighbours. Right: connections between neighbour\u2019s neighbours (our graphical model).\n\n3 Our Model\n\nAlthough the model of [11] solves isometric matching problems optimally, it provides no guarantees\nfor near-isometric problems, as it only considers those compatibilities which form cliques in our\ngraphical model. However, we are often only interested in the boundary of the object: if we look at\nthe instance of the model depicted in \ufb01gure 2, it seems to capture exactly the important dependencies;\nadding additional dependencies between distant points (such as the duck\u2019s tail and head) would be\nunlikely to contribute to this model.\nWith this in mind, we introduce three new features (for brevity we use the shorthand yi = y(si)):\n\u03a61(s1, s2, y1, y2) = (d1(s1, s2) \u2212 d1(y1, y2))2 , where d1(a, b) is the Euclidean distance between\n\na and b, scaled according to the width of the target scene.\n\n\u03a62(s1, s2, s3, y1, y2, y3) = (d2(s1, s2, s3) \u2212 d2(y1, y2, y3))2 , where d2(a, b, c) is the Euclidean\n\ndistance between a and b scaled by the average of the distances between a, b, and c.\n\n\u03a63(s1, s2, s3, y1, y2, y3) = (\u2220(s1, s2, s3) \u2212 \u2220(y1, y2, y3))2 , where \u2220(a, b, c) is the angle between\n\na and c, w.r.t. b.5\n\nWe also include the unary features \u03a60(s1, y1) = (\u03c6(s1)\u2212 \u03c6(y1))2 (i.e. the pointwise squared differ-\nence between \u03c6(s1) and \u03c6(y1)). \u03a61 is exactly the feature used in [11], and is invariant to isometric\ntransformations (rotation, re\ufb02ection, and translation); \u03a62 and \u03a63 capture triangle similarity, and are\nthus also invariant to scale. In the context of (eq. 4), we have\n\n\u03a6(s1, s2, s3, y1, y2, y3) :=(cid:2)\u03a60(s1, y1), \u03a61(s1, s2, y1, y2) + \u03a61(s1, s3, y1, y3),\n\n\u03a62(s1, s2, s3, y1, y2, y3) + \u03a62(s1, s3, s2, y1, y3, y2), \u03a63(s1, s2, s3, y1, y2, y3)(cid:3).\n\n(8)\n\nIn practice, landmark detectors often identify several hundred points [6, 17], which is clearly im-\npractical for an O(|S||U|3) method (|U| is the number of landmarks in the target scene). To address\nthis, we adopt a two stage learning approach: in the \ufb01rst stage, we learn only unary compatibilities,\nexactly as is done in [12]. During the second stage of learning, we collapse the \ufb01rst-order feature\nvector into a single term, namely\n\n0(s1, y1) = (cid:104)\u03b80, \u03a60(s1, y1)(cid:105)\n\u03a6(cid:48)\n\n(9)\n\n(\u03b80 is the weight vector learned during the \ufb01rst stage). We now perform learning for the third-order\nmodel, but consider only the p \u2018most likely\u2019 matches for each node, where the likelihood is simply\n0(s1, y1). This reduces the performance and memory requirements to O(|S|p3).\ndetermined using \u03a6(cid:48)\nA consequence of using this approach is that we must now tune two regularisation constants; this is\nnot an issue in practice, as learning can be performed quickly using this approach.6\n\n5Using features of such different scales can be an issue for regularisation \u2013 in practice we adjusted these\nfeatures to have roughly the same scale. For full details, our implementation is available at (not included for\nblind review).\n\n6In fact, even in those cases where a single stage approach was tractable (such as the experiment in section\n4.1), we found that the two stage approach worked better. Typically, we required much less regularity during\nthe second stage, possibly because the higher order features are heterogeneous.\n\n4\n\n\fFigure 3: Left: The adjacency structure of the graph (top); the boundary of our \u2018shape\u2019 (centre);\nthe topology of our graphical model (bottom). Right: Example matches using linear assignment\n(top, 6/30 mismatches), quadratic assignment (centre, 4/30 mismatches), and the proposed model\n(bottom, no mismatches). The images shown are the 12th and 102nd frames in our sequence. Correct\nmatches are shown in green, incorrect matches in red. All matches are reported after learning.\n\n4 Experiments\n\n4.1 House Data\n\nIn our \ufb01rst experiment, we compare our method to those of [11] and [12]. Both papers report the\nperformance of their methods on the CMU \u2018house\u2019 sequence \u2013 a sequence of 111 frames of a toy\nhouse, with 30 landmarks identi\ufb01ed in each frame.7 As in [12], we compute the Shape Context\nfeatures for each of the 30 points [5].\nIn addition to the unary model of [12], a model based on quadratic assignment is also presented, in\nwhich pairwise features are determined using the adjacency structure of the graphs. Speci\ufb01cally, if a\npair of points (p1, p2) in the template scene is to be matched to (q1, q2) in the target, there is a feature\nwhich is 1 if there is an edge between p1 and p2 in the template, and an edge between q1 and q2 in\nthe target (and 0 otherwise). We also use such a feature for this experiment, however our model only\nconsiders matchings for which (p1, p2) forms an edge in our graphical model (see \ufb01gure 3, bottom\nleft). The adjacency structure of the graphs is determined using the Delaunay triangulation, (\ufb01gure\n3, top left).\nAs in [11], we compare pairs of images with a \ufb01xed baseline (separation between frames). For our\nloss function, \u2206(\u02c6y, yi), we used the normalised Hamming loss, i.e. the proportion of mismatches.\nFigure 4 shows our performance on this dataset, as the baseline increases. On the left we show the\nperformance without learning, for which our model exhibits the best performance by a substantial\nmargin.8\nOur method is also the best performing after learning \u2013 in fact, we achieve almost zero error for all\nbut the largest baselines (at which point our model assumptions become increasingly violated, and\nwe have less training data). In \ufb01gure 5, we see that the running time of our method is similar to the\nquadratic assignment method of [12]. To improve the running time, we also show our results with\np = 10, i.e. for each point in the template scene, we only consider the 10 \u2018most likely\u2019 matches, using\nthe weights from the \ufb01rst stage of learning. This reduces the running time by more than an order of\n\n7http://vasc.ri.cmu.edu/idb/html/motion/house/index.html\n8Interestingly, the quadratic method of [12] performs worse than their unary method; this is likely because\nthe relative scale of the unary and quadratic features is badly tuned before learning, and is indeed similar to\nwhat the authors report. Furthermore, the results we present for the method of [12] after learning are much\nbetter than what the authors report \u2013 in that paper, the unary features are scaled using a pointwise exponent\n(\u2212 exp(\u2212|\u03c6a \u2212 \u03c6b|2)), whereas we found that scaling the features linearly (|\u03c6a \u2212 \u03c6b|2) worked better.\n\n5\n\n\fFigure 4: Comparison of our technique against that of [11] (\u2018point matching\u2019), and [12] (\u2018linear\u2019,\n\u2018quadratic\u2019). The performance before learning is shown on the left, the performance after learning is\nshown on the right. Our method exhibits the best performance both before and after learning (note\nthe different scales of the two plots). Error bars indicate standard error.\n\nFigure 5: The running time and performance of our method, compared to those of [12] (note that the\nmethod of [11] has running time identical to our method). Our method is run from 1 to 20 iterations\nof belief propagation, although the method appears to converge in fewer than 5 iterations.\n\nmagnitude, bringing it closer to that of linear assignment; even this model achieves approximately\nzero error up to a baseline of 50.\nFinally, \ufb01gure 6 (left) shows the weight vector of our model, for a baseline of 60. The \ufb01rst 60\nweights are for the Shape Context features (determined during the \ufb01rst stage of learning), and the\n\ufb01nal 5 show the weights from our second stage of learning (the weights correspond to the \ufb01rst-order\nfeatures, distances, adjacencies, scaled distances, and angles, respectively \u2013 see section 3). We can\nprovide some explanation of the learned weights: the Shape Context features are separated into 5\nradial, and 12 angular bins \u2013 the fact that there are peaks around the 16th and 24th, features indicates\nthat some particular radial bins are more important than the others; the fact that several consecutive\nbins have low weight indicates that some radial bins are unimportant (etc.). It is much more dif\ufb01cult\nto reason about the second stage of learning, as the features have different scales, and cannot be\ncompared directly \u2013 however, it appears that all of the higher-order features are important to our\nmodel.\n\n4.2 Bikes Data\n\nFor our second experiment, we used images of bicycles from the Caltech 256 Dataset [18]. Bicycles\nare reasonably rigid objects, meaning that matching based on their shape is logical. Although the\nimages in this dataset are fairly well aligned, they are subject to re\ufb02ections as well as some scaling\nand shear. For each image in the dataset, we detected landmarks automatically, and six points on\nthe frame were hand-labelled (see \ufb01gure 7). Only shapes in which these interest points were not\noccluded were used, and we only included images that had a background; in total, we labelled 44\n\n6\n\n 0 0.2 0.4 0.6 0.8 10102030405060708090Normalised Hamming loss on test setBaselineHouse data, no learningpoint matchinglinearquadratichigher order 0 0.05 0.1 0.15 0.2 0.25 0.30102030405060708090Normalised Hamming loss on test setBaselineHouse data, learninglinear (learning)quadratic (learning)higher order (learning, 10 points)higher order (learning) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.0001 0.001 0.01 0.1 1Normalised Hamming loss on test setAverage running time (seconds, logarithmic scale)House (baseline = 60)linear (learning)quadratic (learning)higher order (learning, 10 points)higher order (learning)\fFigure 6: Left: The weight vector of our method after learning, for the \u2018house\u2019 data. The \ufb01rst 60\nweights are for the Shape Context features from the \ufb01rst stage of of learning; the \ufb01nal 5 weights are\nfor the second stage of learning. Right: The same plot, for the \u2018bikes\u2019 data.\n\nFigure 7: Top: A selection of our training images. Bottom: An example match from our test set.\nLeft: The template image (with the shape outlined in green, and landmark points marked in blue).\nCentre: The target image, and the match (in red) using unary features with the af\ufb01ne invariant/SIFT\nmodel of [17] after learning (endpoint error = 0.27). Right: the match using our model after learning\n(endpoint error = 0.04).\n\nimages. The \ufb01rst image was used as the \u2018template\u2019, the other 43 were used as targets. Thus we are\nlearning to match bicycles similar to the chosen template.\nInitially, we used the SIFT landmarks and features as described in [6]. Since this approach typically\nidenti\ufb01es several hundred landmarks, we set p = 20 for this experiment (i.e. we consider the 20\nmost likely points). Since we cannot hope to get exact matches, we use the endpoint error instead\nof the normalised Hamming loss, i.e. we reward points which are close to the correct match.9 Table\n1 reveals that the performance of this method is quite poor, even with the higher-order model, and\nfurthermore reveals no bene\ufb01t from learning. This may be explained by the fact that although the\nSIFT features are invariant to scale and rotation, they are not invariant to re\ufb02ection.\nIn [17], the authors report that the SIFT features can provide good matches in such cases, as long as\nlandmarks are chosen which are locally invariant to af\ufb01ne transformations. They give a method for\nidentifying af\ufb01ne-invariant feature points, whose SIFT features are then computed.10 We achieve\nmuch better performance using this method, and also observe a signi\ufb01cant improvement after learn-\ning. Figure 7 shows an example match using both the unary and higher-order techniques.\nFinally, \ufb01gure 6 (right) shows the weights learned for this model. Interestingly, the \ufb01rst-order term\nduring the second stage of learning has almost zero weight. This must not be misinterpreted: during\nthe second stage, the response of each of the 20 candidate points is so similar that the \ufb01rst-order fea-\ntures are simply unable to convey any new information \u2013 yet they are still very useful in determining\nthe 20 candidate points.\n\n9Here the endpoint error is just the average Euclidean distance from the correct label, scaled according to\n\nthe width of the image.\n\n10We used publicly available implementations of both methods.\n\n7\n\n-2-1.5-1-0.5 0 0.5 1 1.5 2ImportanceIndexHouse data first/higher order weight vector (baseline = 60)-0.2-0.1 0 0.1 0.2  -8-6-4-2 0 2 4 6 8ImportanceIndexBikes data first/higher order weight vector-3-2-1 0 1 2 3  \fTable 1: Performance on the \u2018bikes\u2019 dataset. The endpoint error is reported, with standard errors in\nparentheses (note that the second-last column, \u2018higher-order\u2019 uses the weights from the \ufb01rst stage of\nlearning, but not the second).\n\nDetector/descriptor\nSIFT [6]\n\nunary\n\nTraining:\n0.335 (0.038)\nValidation: 0.343 (0.027)\nTesting:\n0.351 (0.024)\n\n+ learning\n0.319 (0.034)\n0.329 (0.019)\n0.312 (0.015)\n\nhigher-order\n0.234 (0.047)\n0.236 (0.031)\n0.302 (0.045)\n\n+ learning\n0.182 (0.031)\n0.257 (0.033)\n0.311 (0.039)\n\nAf\ufb01ne invariant/SIFT [17]\n\nTraining:\n0.322 (0.018)\nValidation: 0.337 (0.015)\nTesting:\n0.332 (0.024)\n\n0.280 (0.016)\n0.298 (0.019)\n0.339 (0.028)\n\n0.233 (0.042)\n0.245 (0.028)\n0.277 (0.035)\n\n0.244 (0.042)\n0.229 (0.032)\n0.231 (0.034)\n\n5 Conclusion\n\nWe have presented a model for near-isometric shape matching which is robust to typical additional\nvariations of the shape. This is achieved by performing structured learning in a graphical model that\nencodes features with several different types of invariances, so that we can directly learn a \u201ccom-\npound invariance\u201d instead of taking for granted the exclusive assumption of isometric invariance.\nOur experiments revealed that structured learning with a principled graphical model that encodes\nboth the rigid shape as well as non-isometric variations gives substantial improvements, while still\nmaintaining competitive performance in terms of running time.\nAcknowledgements: We thank Marconi Barbosa and James Petterson for proofreading. NICTA\nis funded by the Australian Government\u2019s Backing Australia\u2019s Ability initiative, and the Australian\nResearch Council\u2019s ICT Centre of Excellence program.\n\nReferences\n[1] Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. PAMI\n\n24 (2002) 509\u2013522\n\n[2] Mori, G., Belongie, S., Malik, J.: Shape contexts enable ef\ufb01cient retrieval of similar shapes. In: CVPR.\n\n(2001) 723\u2013730\n\n[3] Mori, G., Malik, J.: Estimating human body con\ufb01gurations using shape context matching. In: ECCV.\n\n(2002) 666\u2013680\n\n[4] Frome, A., Huber, D., Kolluri, R., Bulow, T., Malik, J.: Recognizing objects in range data using regional\n\npoint descriptors. In: ECCV. (2004)\n\n[5] Belongie, S., Malik, J.: Matching with shape contexts. In: CBAIVL00. (2000) 20\u201326\n[6] Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV. (1999) 1150\u20131157\n[7] Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. IJCV 61 (2005) 55\u201379\n[8] Felzenszwalb, P.F., Schwartz, J.D.: Hierarchical matching of deformable shapes. In: CVPR. (2007)\n[9] LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to\n\npose and lighting. CVPR (2004) 97\u2013104\n\n[10] Carmichael, O., Hebert, M.: Shape-based recognition of wiry objects. PAMI 26 (2004) 1537\u20131552\n[11] McAuley, J.J., Caetano, T.S., Barbosa, M.S.: Graph rigidity, cyclic belief propagation and point pattern\n\nmatching. PAMI 30 (2008) 2047\u20132054\n\n[12] Caetano, T., Cheng, L., Le, Q., Smola, A.: Learning graph matching. In: ICCV. (2007) 1\u20138\n[13] Canny, J.: A computational approach to edge detection. In: RCV. (1987) 184\u2013203\n[14] Smith, S.: A new class of corner \ufb01nder. In: BMVC. (1992) 139\u2013148\n[15] Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdepen-\n\ndent and structured output spaces. In: ICML. (2004)\n\n[16] Teo, C., Le, Q., Smola, A., Vishwanathan, S.: A scalable modular convex solver for regularized risk\n\nminimization. In: KDD. (2007)\n\n[17] Mikolajczyk, K., Schmid, C.: Scale and af\ufb01ne invariant interest point detectors. 60 (2004) 63\u201386\n[18] Grif\ufb01n, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical Report 7694, California\n\nInstitute of Technology (2007)\n\n8\n\n\f", "award": [], "sourceid": 58, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Julian", "family_name": "Mcauley", "institution": null}, {"given_name": "Tib\u00e9rio", "family_name": "Caetano", "institution": null}]}