{"title": "Kernel Latent SVM for Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 809, "page_last": 817, "abstract": "Latent SVMs (LSVMs) are a class of powerful tools that have been successfully applied to many applications in computer vision. However, a limitation of LSVMs is that they rely on linear models. For many computer vision tasks, linear models are suboptimal and nonlinear models learned with kernels typically perform much better. Therefore it is desirable to develop the kernel version of LSVM. In this paper, we propose kernel latent SVM (KLSVM) -- a new learning framework that combines latent SVMs and kernel methods. We develop an iterative training algorithm to learn the model parameters. We demonstrate the effectiveness of KLSVM using three different applications in visual recognition. Our KLSVM formulation is very general and can be applied to solve a wide range of applications in computer vision and machine learning.", "full_text": "Kernel Latent SVM for Visual Recognition\n\nWeilong Yang\n\nSchool of Computing Science\n\nSimon Fraser University\n\nwya16@sfu.ca\n\nYang Wang\n\nDepartment of Computer Science\n\nUniversity of Manitoba\n\nywang@cs.umanitoba.ca\n\nArash Vahdat\n\nSchool of Computing Science\n\nSimon Fraser University\n\navahdat@sfu.ca\n\nGreg Mori\n\nSchool of Computing Science\n\nSimon Fraser University\n\nmori@cs.sfu.ca\n\nAbstract\n\nLatent SVMs (LSVMs) are a class of powerful tools that have been successfully\napplied to many applications in computer vision. However, a limitation of LSVMs\nis that they rely on linear models. For many computer vision tasks, linear mod-\nels are suboptimal and nonlinear models learned with kernels typically perform\nmuch better. Therefore it is desirable to develop the kernel version of LSVM. In\nthis paper, we propose kernel latent SVM (KLSVM) \u2013 a new learning framework\nthat combines latent SVMs and kernel methods. We develop an iterative train-\ning algorithm to learn the model parameters. We demonstrate the effectiveness of\nKLSVM using three different applications in visual recognition. Our KLSVM for-\nmulation is very general and can be applied to solve a wide range of applications\nin computer vision and machine learning.\n\n1\n\nIntroduction\n\nWe consider the problem of learning discriminative classi\ufb01cation models for visual recognition. In\nparticular, we are interested in models that have the following two characteristics: 1) can be used on\nweakly labeled data; 2) have nonlinear decision boundaries.\nLinear classi\ufb01ers are a class of popular learning methods in computer vision. In the case of binary\nclassi\ufb01cation, they are prediction models in the form of f (x) = w(cid:62)x, where x is the feature vector,\nand w is a vector of model parameters1. The classi\ufb01cation decision is based on the value of f (x).\nLinear classi\ufb01ers are amenable to ef\ufb01cient and scalable learning/inference \u2013 an important factor in\nmany computer vision applications that involve high dimension features and large datasets. The\nperson detection algorithm in [2] is an example of the success of linear classi\ufb01ers in computer\nvision. The detector is trained by learning a linear support vector machine based on HOG descriptors\nof positive and negative examples. The model parameter w in this detector can be thought as a\nstatistical template for HOG descriptors of persons.\nThe reliance on a rigid template w is a major limitation of linear classi\ufb01ers. As a result, the learned\nmodels usually cannot effectively capture all the variations (shape, appearance, pose, etc.) in natural\nimages. For example, the detector in [2] usually only works well when a person is in an upright\nposture.\nIn the literature, there are two main approaches for addressing this limitation. The \ufb01rst one is to\nintroduce latent variables into the linear model. In computer vision, this is best exempli\ufb01ed by the\nsuccess of deformable part models (DPM) [5] for object detection. DPM captures shape and pose\nvariations of an object class with a root template covering the whole object and several part tem-\nplates. By allowing these parts to deform from their ideal locations with respect to the root template,\nDPM provides more \ufb02exibility than a rigid template. Learning a DPM involves solving a latent\n\n1Without loss of generality, we assume linear models without the bias term.\n\n1\n\n\fSVM (LSVM) [5, 17] \u2013 an extension of regular linear SVM for handling latent variables. LSVM\nprovides a general framework for handling \u201cweakly labeled data\u201d arising in many applications. For\nexample, in object detection, the training data are weakly labeled because we are only given the\nbounding boxes of the objects without the detailed annotation for each part. In addition to modeling\npart deformation, another popular application of LSVM is to use it as a mixture model where the\nmixture component is represented as a latent variable [5, 6, 16].\nThe other main approach is to directly learn a nonlinear classi\ufb01er. The kernel method [1] is a\nrepresentative example along this line of work. A limitation of kernel methods is that the learning is\nmore expensive than linear classi\ufb01ers on large datasets, although ef\ufb01cient algorithms exist for certain\ntypes of kernels (e.g. histogram intersection kernel (HIK) [10]). One possible way to address the\ncomputational issue is to use nonlinear mapping to convert the original feature into some higher\ndimensional space, then apply linear classi\ufb01ers in the high dimensional space [14].\nLatent SVM and kernel methods represent two different, yet complementary approaches for learn-\ning classi\ufb01cation models that are more expressive than linear classi\ufb01ers. They both have their own\nadvantages and limitations. The advantage of LSVM is that it provides a general and elegant formu-\nlation for dealing with many weakly supervised problems in computer vision. The latent variables\nin LSVM can often have some intuitive and semantic meanings. As a result, it is usually easy to\nadapt LSVM to capture various prior knowledge about the unobserved variables in various appli-\ncations. Examples of latent variables in the literature include part locations in object detection [5],\nsubcategories in video annotation [16], object localization in image classi\ufb01cation [8], etc. However,\nLSVM is essentially a parametric model. So the capacity of these types of models is limited by the\nparametric form. In contrast, kernel methods are non-parametric models. The model complexity is\nimplicitly determined by the number of support vectors. Since the number of support vectors can\nvary depending on the training data, kernel methods can adapt their model complexity to \ufb01t the data.\nIn this paper, we propose kernel latent SVM (KLSVM) \u2013 a new learning framework that combines\nlatent SVMs and kernel methods. As a result, KLSVM has the bene\ufb01ts of both approaches. On\none hand, the latent variables in KLSVM can be something intuitive and semantically meaningful.\nOn the other hand, KLSVM is nonparametric in nature, since the decision boundary is de\ufb01ned\nimplicitly by support vectors. We demonstrate KLSVM on three applications in visual recognition:\n1) object classi\ufb01cation with latent localization; 2) object classi\ufb01cation with latent subcategories; 3)\nrecognition of object interactions.\n2 Preliminaries\nIn this section, we introduce some background on latent SVM and on the dual form of SVMs used\nfor deriving kernel SVMs. Our proposed model in Sec. 3 will build upon these two ideas.\nLatent SVM: We assume a data instance is in the form of (x, h, y), where x is the observed variable\nand y is the class label. Each instance is also associated with a latent variable h that captures some\nunobserved information about the data. For example, say we want to learn a \u201ccar\u201d model from a\nset of positive images containing cars and a set of negative images without cars. We know there is\na car somewhere in a positive image, but we do not know its exact location. In this case, h can be\nused to represent the unobserved location of the car in the image. In this paper, we consider binary\nclassi\ufb01cation for simplicity, i.e. y \u2208 {+1,\u22121}. Multi-class classi\ufb01cation can be easily converted\nto binary classi\ufb01cation, e.g. using one-vs-all or one-vs-one strategy. To simplify the notation, we\nalso assume the latent variable h takes its value from a discrete set of labels h \u2208 H. However, our\nformulation is general. We will show how to deal with more complex h in Sec. 3.2 and in one of the\nexperiments (Sec. 4.3).\nIn latent SVM, the scoring function of sample x is de\ufb01ned as fw(x) = maxh w(cid:62)\u03c6(x, h), where\n\u03c6(x, h) is the feature vector de\ufb01ned for the pair of (x, h). For example, in the \u201ccar model\u201d example,\n\u03c6(x, h) can be a feature vector extracted from the image patch at location h of the image x. The\nobjective function of LSVM is de\ufb01ned as L(w) = 1\ni max(0, 1 \u2212 yifw(xi)). LSVM\nis essentially a non-convex optimization problem. However, the learning problem becomes convex\nonce the latent variable h is \ufb01xed for positive examples. Therefore, we can train the LSVM by\nan iterative algorithm that alternates between inferring h on positive examples and optimizing the\nmodel parameter w.\nDual form with \ufb01xed h on positive examples : Due to its nature of non-convexity, it is not straight-\nforward to derive the dual form for the general LSVM. Therefore, as a starting point, we \ufb01rst con-\nsider a simpler scenario assuming h is \ufb01xed (or observed) on the positive training examples. As\npreviously mentioned, the LSVM is then relaxed to a convex problem with this assumption. Note\nthat we will relax this assumption in Sec. 3. In the above \u201ccar model\u201d example, this means that\nwe have the ground-truth bounding boxes of the cars in each image. More formally, we are given\n\n2||w||2 + C(cid:80)\n\n2\n\n\fM positive samples {xi, hi}M\nj=M +1. Inspired by linear SVMs,\nour goal is to \ufb01nd a linear discriminant fw(x, h) = w(cid:62)\u03c6(x, h) by solving the following quadratic\nprogram:\n\ni=1, and N negative samples {xj}M +N\n\n(cid:88)\n\n(cid:88)\n\nP(w\u2217) = min\n\n||w||2 + C1\n\n1\n2\n\n\u03bej,h\ns.t. w(cid:62)\u03c6(xi, hi) \u2265 1 \u2212 \u03bei, \u2200i \u2208 {1, 2, ..., M},\n\n\u03bei + C2\n\nw,\u03be\n\nj,h\n\ni\n\n(1a)\n\n(1b)\n(1c)\n(1d)\n\n\u2212w(cid:62)\u03c6(xj, h) \u2265 1 \u2212 \u03bej,h \u2200j \u2208 {M + 1, M + 2, ..., M + N},\u2200h \u2208 H\n\u03bei \u2265 0, \u03bej,h \u2265 0 \u2200i, \u2200j, \u2200h \u2208 H\n\nSimilar to standard SVMs, {\u03bei} and {\u03bej,h} are the slack variables for handling soft margins.\nIt is interesting to note that the optimization problem in Eq. 1 is almost identical to that of standard\nlinear SVMs. The only difference lies in the constraint on the negative training examples (Eq. 1c).\nSince we assume h\u2019s are not observed on negative images, we need to enumerate all possible values\nfor h\u2019s in Eq. 1c. Intuitively, this means every image patch from a negative image (i.e. non-car\nimage) is not a car.\nIt is easy to show that Eq. 1 is convex. Similar to the dual form of standard SVMs, we can derive\nthe dual form of Eq. 1 as follows:\n\nD(\u03b1\u2217, \u03b2\u2217) = max\n\n\u03b1,\u03b2\n\n\u03b1i +\n\n(cid:88)\n\ni\n\n(cid:88)\n\n(cid:88)\n\nj\n\nh\n\n||(cid:88)\n\n\u03b1i\u03c6(xi, hi) \u2212(cid:88)\n\n(cid:88)\n\ni\n\nj\n\nh\n\n\u03b2j,h \u2212 1\n2\n\n\u03b2j,h\u03c6(xj, h)||2 (2a)\n\ns.t.\n\n0 \u2264 \u03b1i \u2264 C1, \u2200i; 0 \u2264 \u03b2j,h \u2264 C2, \u2200j, \u2200h \u2208 H\n\n(2b)\nThe optimal primal parameters w\u2217 for Eq. 1 and the optimal dual parameters (\u03b1\u2217, \u03b2\u2217) for Eq. 2 are\nrelated as follows:\n\n(cid:88)\n\ni \u03c6(xi, hi) \u2212(cid:88)\n\n\u03b1\u2217\n\n(cid:88)\n\nw\u2217 =\n\n\u03b2\u2217\nj,h\u03c6(xj, h)\n\n(3)\n\ni\n\nj\n\nh\n\nLet us de\ufb01ne \u03bb to be the concatenations of {\u03b1i : \u2200i} and {\u03b2j,h : \u2200j,\u2200h \u2208 H}, so |\u03bb| = M +N\u00d7|H|.\nLet \u03a8 be a |\u03bb| \u00d7 D matrix where D is the dimension of \u03c6(x, h). \u03a8 is obtained by stacking together\n{\u03c6(xi, hi) : \u2200i} and {\u2212\u03c6(xj, h) : \u2200j,\u2200h \u2208 H}. We also de\ufb01ne Q = \u03a8\u03a8(cid:62) and 1 to be a vector of\nall 1\u2019s. Then Eq. 2a can be rewritten as (we omit the linear constraints on \u03bb for simplicity):\n\nmax\n\n\u03bb\n\n\u03bb(cid:62) \u00b7 1 \u2212 1\n2\n\n\u03bb(cid:62)Q\u03bb\n\n(4)\n\n(cid:16)(cid:80)\n\nEach entry of Q is a dot-product of\n\nThe advantage of working with the dual form in Eq. 4 is that\nit only involves a so-\ncalled kernel matrix Q.\ntwo vectors in the\nform of \u03c6(x, h)(cid:62)\u03c6(x(cid:48), h(cid:48)). We can replace the dot-product with any other kernel func-\ntions in the form of k(\u03c6(x, h), \u03c6(x(cid:48), h(cid:48))) to get nonlinear classi\ufb01ers [1].\nThe scor-\ning function for\nthe testing images xnew can be kernelized as follows:\nf (xnew) =\ni \u03b1\u2217\nmaxhnew\nAnother important, yet often overlooked fact is that the optimal values of the two quadratic programs\nin Eqs. 1 and 2 have some speci\ufb01c meanings. They correspond to the inverse of the (soft) margin of\nthe resultant SVM classi\ufb01er [9, 15]: P(w\u2217) = D(\u03b1\u2217, \u03b2\u2217) =\nSVM margin. In the next section, we will\nexploit this fact to develop the kernel latent support vector machines.\n\ni k(\u03c6(xi, hi), \u03c6(xnew, hnew)) \u2212(cid:80)\n\nj,hk(\u03c6(xj, h), \u03c6(xnew, hnew))\n\n(cid:80)\nh \u03b2\u2217\n\n(cid:17)\n\n.\n\n1\n\nj\n\n3 Kernel Latent SVM\nNow we assume the variables {hi}M\ni=1 on the positive training examples are unobserved. If the scor-\ning function used for classi\ufb01cation is in the form of f (x) = maxh w(cid:62)\u03c6(x, h), we can use the LSVM\nformulation [5, 17] to learn the model parameters w. As mentioned earlier, the limitation of LSVM\nis the linearity assumption of w(cid:62)\u03c6(x, h). In this section, we propose kernel latent SVM (KLSVM)\n\u2013 a new latent variable learning method that only requires a kernel function K(x, h, x(cid:48), h(cid:48)) between\na pair of (x, h) and (x(cid:48), h(cid:48)).\nNote that when {hi}M\ni=1 are observed on the positive training examples, we can plug them in Eq. 2\nto learn a nonlinear kernelized decision function that separates the positive and negative examples.\n\n3\n\n\fi=1 are latent, an intuitive thing to do is to \ufb01nd the labeling of {hi}M\n\nWhen {hi}M\ni=1 so that when\nwe plug them in and solve for Eq. 2, the resultant nonlinear decision function separates the two\nclasses as widely as possible. In other words, we look for a set of {h\u2217\ni } which can maximize the\nSVM margin (equivalent to minimizing D(\u03b1\u2217, \u03b2\u2217,{hi})). The same intuition was previously used\nto develop the max-margin clustering method in [15]. Using this intuition, we write the optimal\nfunction value of the dual form as D(\u03b1\u2217, \u03b2\u2217,{hi}) since now it implicitly depends on the labelings\n{hi}. We can jointly \ufb01nd the labelings {hi} and solve for (\u03b1\u2217, \u03b2\u2217) by the following optimization\nproblem:\n\nD(\u03b1\u2217, \u03b2\u2217,{hi})\n\nmin{hi}\n\n(cid:88)\n\n\u03b1,\u03b2\n\n= min{hi} max\n\u03b1i +\ns.t. 0 \u2264 \u03b1i \u2264 C1, \u2200i;\n\ni\n\n(cid:88)\n\n(cid:88)\n\nj\n\nh\n\n\u03b2j,h \u2212 1\n2\n\n||(cid:88)\n\n\u03b1i\u03c6(xi, hi) \u2212(cid:88)\n\n(cid:88)\n\n(5a)\n\n\u03b2j,h\u03c6(xj, h)||2 (5b)\n\n0 \u2264 \u03b2j,h \u2264 C2, \u2200j, \u2200h \u2208 H\n\n(5c)\nThe most straightforward way of solving Eq. 5 is to optimize D(\u03b1\u2217, \u03b2\u2217,{hi}) for every possible\ncombination of values for {hi}, and then take the minimum. When hi takes its value from a dis-\ncrete set of K possible choices (i.e. |H| = K), this naive approach needs to solve M K quadratic\nprograms. This is obviously too expensive. Instead, we use the following iterative algorithm:\n\ni\n\nj\n\nh\n\n\u2022 Fix \u03b1 and \u03b2, compute the optimal {hi}\u2217 by\n\n\u03b1i\u03c6(xi, hi) \u2212(cid:88)\n\n(cid:88)\n\n\u03b2j,h\u03c6(xj, h)||2\n\n\u2022 Fix {hi}, compute the optimal (\u03b1\u2217, \u03b2\u2217) by\n\n1\n2\n\n||(cid:88)\n(cid:88)\n\ni\n\n\u03b2j,h \u2212 1\n2\n\n{hi}\u2217 = arg max\n\n{hi}\n\n\uf8f1\uf8f2\uf8f3(cid:88)\n\ni\n\n(cid:88)\n\n\u03b1i +\n\n(\u03b1\u2217, \u03b2\u2217) = arg max\n\n\u03b1,\u03b2\n\nj\n\nh\n\n\u03b1i\u03c6(xi, hi) \u2212(cid:88)\n\n(cid:88)\n\n\u03b2j,h\u03c6(xj, h)||2\n\n||(cid:88)\n\n(6)\n\n\uf8fc\uf8fd\uf8fe (7)\n\nj\n\nh\n\ni\n\nj\n\nh\n\nThe optimization problem in Eq. 7 is a quadratic program similar to that of a standard dual SVM.\nAs a result, Eq. 7 can be kernelized as Eq. 4 and solved using standard dual solver in regular SVMs.\nIn Sec. 3.1, we describe how to kernelize and solve the optimization problem in Eq. 6.\n3.1 Optimization over {hi}\nThe complexity of a simple enumeration approach for solving Eq. 6 is again O(M K), which is\nclearly too expensive for practical purposes.\nInstead, we solve it iteratively using an algorithm\nsimilar to co-ordinate ascent. Within an iteration, we choose one positive training example t. We\nupdate ht while \ufb01xing hi for all i (cid:54)= t. The optimal h\u2217\n\nt can be computed as follows:\n\nh\u2217\nt = arg max\n\nht\n\n||\u03b1t\u03c6(xt, ht) +\n\n\u21d4 arg max\n\nht\n\n||\u03b1t\u03c6(xt, ht)||2 + 2\n\n\u03b2j,h\u03c6(xj, h)||2\n\n\uf8f6\uf8f8(cid:62)\n\n(8a)\n\n\u03b1t\u03c6(xt, ht) (8b)\n\n\u03b2j,h\u03c6(xj, h)\n\ni:i(cid:54)=t\n\nj\n\nh\n\n(cid:88)\n\u03b1i\u03c6(xi, hi) \u2212(cid:88)\n(cid:88)\n\uf8eb\uf8ed(cid:88)\n\u03b1i\u03c6(xi, hi) \u2212(cid:88)\n(cid:88)\n\ni:i(cid:54)=t\n\nj\n\nh\n\n(cid:88)\n\ni:i(cid:54)=t\n\nBy replacing the dot-product \u03c6(x, h)(cid:62)\u03c6(x(cid:48), h(cid:48)) with a kernel function k(\u03c6(x, h), \u03c6(x(cid:48), h(cid:48))), we ob-\ntain the kernerlized version of Eq. 8(b) as follows\n\n\u03b1t\u03b1tk(\u03c6(xt, ht), \u03c6(xt, ht)) + 2\n\n\u03b1i\u03b1tk(\u03c6(xi, hi), \u03c6(xt, ht))\n\nh\u2217\nt = arg max\n\nht\n\n(cid:88)\n\n(cid:88)\n\n\u22122\n\n\u03b2j,h\u03b1tk(\u03c6(xj, h), \u03c6(xt, ht))\n\n(9)\n\nj\n\nh\n\nIt is interesting to notice that if the t-th example is not a support vector (i.e. \u03b1t = 0), the function\nvalue of Eq. 9 will be zero regardless of the value of ht. This means in KLSVM we can improve the\ntraining ef\ufb01ciency by only performing Eq. 9 on positive examples corresponding to support vectors.\nFor other positive examples (non-support vectors), we can simply set their latent variables the same\n\n4\n\n\fas the previous iteration. Note that in LSVM, the inference during training needs to be performed\non every positive example.\nConnection to LSVM: When a linear kernel is used, the inference problem (Eq. 8) has a very\ninteresting connection to LSVM in [5]. Recall that for linear kernels, the model parameters w and\ndual variables (\u03b1, \u03b2) are related by Eq. 3. Then Eq. 8 becomes:\n\n||\u03b1t\u03c6(xt, ht)||2 + 2(cid:0)w \u2212 \u03b1t\u03c6(xt, hold\nt )(cid:1)(cid:62)\n\n\u03b1t\u03c6(xt, ht)\n\nh\u2217\nt = arg max\n\u21d4 arg max\n\nht\n\nht\n\n\u03b1tw(cid:62)\u03c6(xt, ht) +\n\n1\n2\n\nt||\u03c6(xt, ht)||2 \u2212 \u03b12\n\u03b12\n\nt \u03c6(xt, hold\n\nt )(cid:62)\u03c6(xt, ht)\n\n(10a)\n\n(10b)\n\nwhere hold\nis the value of latent variable of the t-th example in the previous iteration. Let us con-\nsider the situation when \u03b1t (cid:54)= 0 and the feature vector \u03c6(x, h) is l2 normalized, which is com-\nt\nt \u03c6(xt, ht)(cid:62)\u03c6(xt, ht) is a constant, and we have\nmonly used in computer vision.\n\u03c6(xt, hold\n\nt )(cid:62)\u03c6(xt, hold\nt ) > \u03c6(xt, hold\nh\u2217\nt = arg max\n\nIn this case, \u03b12\nt )(cid:62)\u03c6(xt, ht) if ht (cid:54)= hold\nw(cid:62)\u03c6(xt, ht) \u2212 \u03b1t\u03c6(xt, hold\n\nt\n\n. Then Eq. 10 is equivalent to:\nt )(cid:62)\u03c6(xt, ht)\n\n(11)\n\nht\n\nt = arg maxht w(cid:62)\u03c6(xt, ht), but\nEq. 11 is very similar to the inference problem in LSVM, i.e., h\u2217\nt )(cid:62)\u03c6(xt, ht) which penalizes the choice of ht for being the same\nwith an extra term \u03b1t\u03c6(xt, hold\nvalue as previous iteration hold\n. This has a very appealing intuitive interpretation. If the t-th positive\nt\nexample is a support vector, the latent variable hold from previous iteration causes this example to\nlie very close to (or even on the wrong side) the decision boundary, i.e. the example is not well-\nseparated. During the current iteration, the second term in Eq. 11 penalizes hold to be chosen again\nsince we already know the example will not be well-separated if we choose hold again. The amount\n)(cid:62)\u03c6(xt, ht). We can interpret \u03b1t as how\nof penalty depends on the magnitudes of \u03b1t and \u03c6(xt, hold\nt to\n\u201cbad\u201d hold\nbe \u201cclose\u201d to \u201cbad\u201d hold\n\n)(cid:62)\u03c6(xt, ht) as how close ht is to hold\n\n. Eq. 11 penalizes the new h\u2217\n\nis, and \u03c6(xt, hold\n\nt\n\nt\n\nt\n\nt\n\n.\n\nt\n\n3.2 Composite Kernels\nSo far we have assumed that the latent variable h takes its value from a discrete set of labels. Given\na pair of (x, h) and (x(cid:48), h(cid:48)), the types of kernel function k(x, h; x(cid:48), h(cid:48)) we can choose from are still\nlimited to a handful of standard kernels (e.g. Gaussian, RBF, HIK, etc). In this section, we consider\nmore interesting cases where h involves some complex structures. This will give us two important\nbene\ufb01ts. First of all, it allows us to exploit structural information in the latent variables. This is in\nanalog to structured output learning (e.g. [12, 13]). More importantly, it gives us more \ufb02exibility to\nconstruct new kernel functions by composing from simple kernels.\nBefore we proceed, let us \ufb01rst motivate the composite kernel with an example application. Suppose\nwe want to detect some complex person-object interaction (e.g. \u201cperson riding a bike\u201d) in an image.\nOne possible solution is to detect persons and bikes in an image, then combine the results by taking\ninto account of their relationship (i.e. \u201criding\u201d). Imagine we already have kernel functions corre-\nsponding to some components (e.g. person, bike) of the interaction. In the following, we will show\nhow to compose a new kernel for the \u201cperson riding a bike\u201d classi\ufb01er from those components.\nWe denote the latent variable using (cid:126)h to emphasize that now it is a vector instead of a single discrete\nvalue. We denote it as (cid:126)h = (z1, z2, ...), where zu is the u-th component of (cid:126)h and takes its value\nfrom a discrete set of possible labels. For the structured latent variable, it is assumed that there are\ncertain dependencies between some pairs of (zu, zv). We can use an undirected graph G = (V,E) to\ncapture the structure of the latent variable, where a vertex u \u2208 V corresponds to the label zu, and an\nedge (u, v) \u2208 E corresponds to the dependency between zu and zv. As a concrete example, consider\nthe \u201cperson riding a bike\u201d recognition problem. The latent variable in this case has two components\n(cid:126)h = (zperson, zbike) corresponding to the location of person and bike, respectively. On the training\ndata, we have access to the ground-truth bounding box of \u201cperson riding a bike\u201d as a whole, but not\nthe exact location of \u201cperson\u201d or \u201cbike\u201d within the bounding box. So (cid:126)h is latent in this application.\nThe edge connecting zperson and zbike captures the relationship (e.g. \u201criding on\u201d, \u201cnext to\u201d, etc.)\nbetween these two objects.\nSuppose we already have kernel functions corresponding to the vertices and edges in the graph, we\ncan then de\ufb01ne the composite kernel as the summation of the kernels over all the vertices and edges.\n\n5\n\n\fFigure 1: Visualization of how the latent variable (i.e. object location) changes during the learning. The red\nbounding box corresponds to the initial object location. The blue bounding box corresponds to the object\nlocation after the learning.\n\nMethod\nAcc (%)\n\nBOF + linear SVM BOF + kernel SVM linear LSVM\n75.07 \u00b1 4.18\n\n45.57 \u00b1 4.23\n\n50.53 \u00b1 6.53\n\nKLSVM\n\n84.49 \u00b1 3.63\n\nTable 1: Results on the mammal dataset. We show the mean/std of classi\ufb01cation accuracies over \ufb01ve rounds of\nexperiments.\n\nK(\u03a6(x, (cid:126)h), \u03a6(x(cid:48), (cid:126)h(cid:48))) =\n\nku(\u03c6(x, zu), \u03c6(x(cid:48), z(cid:48)\n\nu)) +\n\nkuv(\u03c8(x, zu, zv), \u03c8(x(cid:48), z(cid:48)\n\nu, z(cid:48)\n\nv)) (12)\n\n(cid:88)\n\nu\u2208V\n\n(cid:88)\n\n(u,v)\u2208E\n\nWhen the latent variable (cid:126)h forms a tree structure, there exist ef\ufb01cient inference algorithms for\nsolving Eq. 9, such as dynamic programming.\nIt is also possible for Eq. 12 to include kernels\nde\ufb01ned on higher-order cliques in the graph, as long as we have some pre-de\ufb01ned kernel functions\nfor them.\n4 Experiments\nWe evaluate KLSVM in three different applications of visual recognition. Each application has a\ndifferent type of latent variables. For these applications, we will show that KLSVM outperforms\nboth the linear LSVM [5] and the regular kernel SVM. Note that we implement the learning of\nlinear LSVM by ourselves using the same iterative algorithm as the one in [5].\n\n4.1 Object Classi\ufb01cation with Latent Localization\nProblem and Dataset: We consider object classi\ufb01cation with image-level supervision. Our training\ndata only have image-level labels indicating the presence/absence of each object category in an\nimage. The exact object location in the image is not provided and is considered as the latent variable\nh in our formulation. We de\ufb01ne the feature vector \u03c6(x, h) as the HOG feature extracted from the\nimage at location h. During testing, the inference of h is performed by enumerating all possible\nlocations of the image.\nWe evaluate our algorithm on the mammal dataset [8] which consists of 6 mammal categories. There\nare about 45 images per category. For each category, we use half of the images for training and the\nremaining half for testing. We assume the object size is the same for the images of the same category,\nwhich is a reasonable assumption for this dataset. This dataset was used to evaluate the linear LSVM\nin [8].\nResults: We compare our algorithm with linear LSVM. To demonstrate the bene\ufb01t of using latent\nvariables, we also compare with two simple baselines using linear and kernel SVMs based on bag-of-\nfeatures (BOF) extracted from the whole image (i.e. without latent variables). For both baselines, we\naggregate the quantized HOG features densely sampled from the whole image. Then, the features are\nfed into the standard linear SVM and kernel SVM respectively. We use the histogram intersection\nkernel (HIK) [10] since it has been proved to be successful for vision applications, and ef\ufb01cient\nlearning/inference algorithms exist for this kernel.\nWe run the experiments for \ufb01ve rounds. In each round, we randomly split the images from each\ncategory into training and testing sets. For both linear LSVM and KLSVM, we initialize the latent\nvariable at the center location of each image and we set C1 = C2 = 1. For both algorithms, we use\none-versus-one classi\ufb01cation scheme. We use the HIK kernel in the KLSVM. Table 1 summarizes\nthe mean and standard deviations of the classi\ufb01cation accuracies over \ufb01ve rounds of experiments.\nAcross all experiments, both linear LSVM and KLSVM achieve signi\ufb01cantly better results than\napproaches using BOF features from the whole image. This is intuitively reasonable since most of\nimages on this dataset share very similar scenes. So BOF feature without latent variables cannot\ncapture the subtle differences between each category. Table 1 also shows KLSVM signi\ufb01cantly\noutperforms linear LSVM.\nFig. 1 shows examples of how the latent variables change on some training images during the learn-\ning of the KLSVM. For each training image, the location of the object (latent variable h) is initialized\nto the center of the image. After the learning algorithm terminates, the latent variables accurately\nlocate the objects.\n\n6\n\n\fFigure 2: Visualization of some testing examples from the \u201cbird\u201d (left) and \u201cboat\u201d (right) categories. Each row\ncorresponds to a subcategory. We can see that visually similar images are grouped into the same subcategory.\n\nMethod\nAcc (%)\n\nnon-latent linear SVM linear LSVM non-latent kernel SVM\n\n50.69 \u00b1 0.38\n\n53.13 \u00b1 0.63\n\n52.98 \u00b1 0.22\n\nKLSVM\n\n55.17 \u00b1 0.27\n\nTable 2: Results on CIFAR10 Dataset. We show the mean/std of classi\ufb01cation accuracies over \ufb01ve folds of\nexperiments. Each fold uses a different batch of the training data.\n\n4.2 Object Classi\ufb01cation with Latent Subcategory\nProblem and Dataset: Our second application is also on object classi\ufb01cation. But here we con-\nsider a different type of latent variable. Objects within a category usually have a lot of intra-class\nvariations. For example, consider the images for the \u201cbird\u201d category shown in the left column of\nFig. 2. Even though they are examples of the same category, they still exhibit very large appearance\nvariations. It is usually very dif\ufb01cult to learn a single \u201cbird\u201d model that captures all those variations.\nOne way to handle the intra-class variation is to split the \u201cbird\u201d category into several subcategories.\nExamples within a subcategory will be more visually similar than across all subcategories. Here we\nuse the latent variable h to indicate the subcategory an image belongs to. If a training image belongs\nto the class c, its subcategory label h takes value from a set Hc of subcategory labels corresponding\nto the c-th class. Note that subcategories are latent on the training data, so they may or may not have\nsemantic meanings.\nThe feature vector \u03c6(x, h) is de\ufb01ned as a sparse vector whose feature dimension is |Hc| times of\nIn the\nthe dimension of \u03c6(x), where \u03c6(x) is the HOG descriptor extracted from the image x.\nexperiments, we set |Hc| = 3 for all c\u2019s. Then we can de\ufb01ne \u03c6(x, h = 1) = (\u03c6(x); 0; 0), \u03c6(x, h =\n2) = (0; \u03c6(x); 0), and so on. Similar models have been proposed to address the viewpoint changing\nin object detection [6] and semantic variations in YouTube video tagging [16].\nWe use the CIFAR10 [7] dataset in our experiment. It consists of images from ten classes including\nairplane, automobile, bird, cat, etc. The training set has been divided into \ufb01ve batches and each\nbatch contains 10000 images. There are in total 10000 test images.\nResults: Again we compare with three baselines: linear LSVM, non-latent linear SVM, non-latent\nkernel SVM. Similarly, we use HIK kernel for the kernel-based methods. For non-latent approaches,\nwe simply feed feature vector \u03c6(x) to SVMs without using any latent variable.\nWe run the experiments in \ufb01ve folds. Each fold use a different training batch but the same testing\nbatch. We set C1 = C2 = 0.01 for all the experiments and initialize the subcategory labels of\ntraining images by k-means clustering. Table 2 summarizes the results. Again, KLSVM outperforms\nother baseline approaches. It is interesting to note that both linear LSVM and KLSVM outperform\ntheir non-latent counterparts, which demonstrates the effectiveness of using latent subcategories in\nobject classi\ufb01cation. We visualize examples of the correctly classi\ufb01ed testing images from the \u201cbird\u201d\nand \u201cboat\u201d categories in Fig. 2. Images on the same row are assigned the same subcategory labels.\nWe can see that visually similar images are automatically grouped into the same subcategory.\n\n4.3 Recognition of Object Interaction\nProblem and Dataset: Finally, we consider an application where the latent variable is more com-\nplex and requires the composite kernel introduced in Sec. 3.2. We would like to recognize complex\ninteractions between two objects (also called \u201cvisual phrases\u201d [11]) in static images. We build a\ndataset consisting of four object interaction classes, i.e. \u201cperson riding a bicycle\u201d, \u201cperson next to\na bicycle\u201d, \u201cperson next to a car\u201d and \u201cbicycle next to a car\u201d based on the visual phrase dataset in\n[11]. Each class contains 86\u223c116 images. Each image is only associated with one of the four object\ninteraction label. There is no ground-truth bounding box information for each object. We use 40\nimages from each class for training and the rest for testing.\nOur approach: We treat the locations of objects as latent variables. For example, when learning\nthe model for \u201cperson riding a bicycle\u201d, we treat the locations of \u201cperson\u201d and \u201cbicycle\u201d as latent\nIn this example, each image is associated with latent variables (cid:126)h = (z1, z2), where\nvariables.\nz1 denotes the location of the \u201cperson\u201d and z2 denotes the location of the \u201cbicycle\u201d. To reduce\nthe search space of inference, we \ufb01rst apply off-the-shelf \u201cperson\u201d and \u201cbicycle\u201d detectors [5] on\n\n7\n\n\fMethod\nAcc(%)\n\nBOF + linear SVM BOF + kernel SVM linear LSVM\n46.33 \u00b1 1.4\n\n42.92\n\n58.46\n\nKLSVM\n\n66.42 \u00b1 0.99\n\nTable 3: Results on object interaction dataset. For the approaches using latent variables, we show the mean/std\nof classi\ufb01cation accuracies over \ufb01ve folds of experiments.\n\nFigure 3: Visualization of how latent variables (i.e. object locations) change during the learning. The left image\nis from the \u201cperson riding a bicycle\u201d category, and the right image is from the \u201cperson next to a car\u201d category.\nYellow bounding boxes corresponds to the initial object locations. The blue bounding boxes correspond to the\nobject locations after the learning.\neach image. For each object, we generate \ufb01ve candidate bounding boxes which form a set Zi,\n|Z1| = |Z2| = 5 and zi \u2208 Zi. Then, the inference of (cid:126)h is performed by enumerating 25\ni.e.\ncombinations of z1 and z2. We also assume there are certain dependencies between the pair of\n(z1, z2). Then the kernel between two images can be de\ufb01ned as follows:\n\n(cid:88)\n\nu={1,2}\n\nK(\u03a6(x, (cid:126)h), \u03a6(x(cid:48), (cid:126)h(cid:48))) =\n\nku (\u03c6(x, zu), \u03c6(x(cid:48), z(cid:48)\n\nu)) + kp (\u03c8(z1, z2), \u03c8(z(cid:48)\n\n1, z(cid:48)\n\n2))\n\n(13)\n\nWe de\ufb01ne \u03c6(x, zu) as the bag-of-features (BOF) extracted from the bounding box zu in the image x.\nFor each bounding box, we split the region uniformly into four equal quadrants. Then we compute\nthe bag-of-features for each quadrant by aggregating quantized HOG features. The \ufb01nal feature\nvector is the concatenation of these four bag-of-features histograms. This feature representation\nis similar to the spatial pyramid feature representation.\nIn our experiment, we choose HIK for\nku(\u00b7). The kernel kp(\u00b7) captures the spatial relationship between z1 and z2 such as above, below,\noverlapping, next-to, near, and far. Here \u03c8(z1, z2) is a sparse binary vector and its k-th element is\nset to 1 if the corresponding k-th relation is satis\ufb01ed between bounding boxes z1 and z2. Note that\nkp(\u00b7) does not depend on the images. Similar representation has been used in [4]. We de\ufb01ne kp(\u00b7)\nas a simple linear kernel.\nResults: We compare with the simple BOF + linear SVM, and BOF + kernel SVM approaches.\nThese two baselines use the same BOF feature representation as our approach except that the features\nare extracted from the whole image. We choose the HIK in the kernel SVM. Note that this is a\nstrong baseline since [3] has shown that a similar pyramid feature representation with kernel SVM\nachieves top performances on the task of person-object interaction recognition. The other baseline\nis the standard linear LSVM, in which we build the feature vector \u03c6(x, h) by simply concatenating\nboth unary features and pairwise features, i.e. \u03c6(x, h) = [\u03c6(x, z1); \u03c6(x, z2); \u03c8(z1, z2)]. Again, we\nset C1 = C2 = 1 for all experiments. We run the experiments for \ufb01ve rounds for approaches using\nlatent variables. In each round, we randomly initialize the choices of z1 and z2. Table 3 summarizes\nthe results. The kernel latent SVM that uses HIK for ku(\u00b7) achieves the best performance.\nFig. 3 shows examples of how the latent variables change on some training images during the learn-\ning of the KLSVM. For each training image, both latent variables z1 and z2 are randomly initialized\nto one of \ufb01ve candidate bounding boxes. As we can see, the initial bounding boxes can accurately\nlocate the target objects but their spatial relations are different to ground-truth labels. After learning\nalgorithm terminates, the latent variables not only locate the target objects, but more importantly\nthey also capture the correct spatial relationship between objects.\n\n5 Conclusion\nWe have proposed kernel latent SVM \u2013 a new learning framework that combines the bene\ufb01ts of\nLSVM and kernel methods. Our learning framework is very general. The latent variables can not\nonly be a single discrete value, but also be more complex values with interdependent structures. Our\nexperimental results on three different applications in visual recognition demonstrate that KLSVM\noutperforms using LSVM or using kernel methods alone. We believe our work will open the\npossibility of constructing more powerful and expressive prediction models for visual recognition.\nAcknowledgement: This work was supported by a Google Research Award and NSERC.\nYang Wang was partially supported by a NSERC postdoc fellowship.\n\n8\n\n\fReferences\n[1] C. J. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge\n\nDiscovery, 2(2):121\u2013167, 1998.\n\n[2] N. Dalal and B. Triggs. Histogram of oriented gradients for human detection. In IEEE Computer Society\n\nConference on Computer Vision and Pattern Recognition, 2005.\n\n[3] V. Delaitre, I. Laptev, and J. Sivic. Recognizing human actions in still images: a study of bag-of-features\n\nand part-based representations. In British Machine Vision Conference, 2010.\n\n[4] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class object layout. In IEEE\n\nInternational Conference on Computer Vision, 2009.\n\n[5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discrimi-\nnatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n32(9):1672\u20131645, 2010.\n\n[6] C. Gu and X. Ren. Discriminative mixture-of-templates for viewpoint classi\ufb01cation. In European Con-\n\n[7] A. Krizhevsky. Learning multiple layers of features from tiny images. Master\u2019s thesis, University of\n\nference on Computer Vision, 2010.\n\nToronto, 2009.\n\n[8] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In Advances in\n\nNeural Information Processing Systems, 2010.\n\n[9] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. R. Ghaoui, and M. I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. Journal of Machine Learning Research, 5:24\u201372, 2004.\n\n[10] S. Maji, A. C. Berg, and J. Malik. Classi\ufb01cation using intersection kernel support vector machines is\n\nef\ufb01cient. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008.\n\n[11] M. A. Sadeghi and A. Farhadi. Recognition using visual phrases. In IEEE Computer Society Conference\n\non Computer Vision and Pattern Recognition, 2011.\n\n[12] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in Neural Information\n\nProcessing Systems, volume 16. MIT Press, 2004.\n\n[13] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and\n\ninterdependent output variables. Journal of Machine Learning Research, 6:1453\u20131484, 2005.\n\n[14] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. Pattern Analysis and\n\nMachine Intellingence, 34(3), 2012.\n\n[15] L. Xu, J. Neufeldand, B. Larson, and D. Schuurmans. Maximum margin clustering.\n\nIn L. K. Saul,\nY. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 17, pages\n1537\u20131544. MIT Press, Cambridge, MA, 2005.\n\n[16] W. Yang and G. Toderici. Discriminative tag learning on youtube videos with latent sub-tags. In IEEE\n\nComputer Society Conference on Computer Vision and Pattern Recognition, 2011.\n\n[17] C.-N. Yu and T. Joachims. Learning structural SVMs with latent variables. In International Conference\n\non Machine Learning, 2009.\n\n9\n\n\f", "award": [], "sourceid": 383, "authors": [{"given_name": "Weilong", "family_name": "Yang", "institution": null}, {"given_name": "Yang", "family_name": "Wang", "institution": null}, {"given_name": "Arash", "family_name": "Vahdat", "institution": null}, {"given_name": "Greg", "family_name": "Mori", "institution": null}]}