{"title": "Bilevel Distance Metric Learning for Robust Image Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 4198, "page_last": 4207, "abstract": "Metric learning, aiming to learn a discriminative Mahalanobis distance matrix M that can effectively reflect the similarity between data samples, has been widely studied in various image recognition problems. Most of the existing metric learning methods input the features extracted directly from the original data in the preprocess phase. What's worse, these features usually take no consideration of the local geometrical structure of the data and the noise existed in the data, thus they may not be optimal for the subsequent metric learning task. In this paper, we integrate both feature extraction and metric learning into one joint optimization framework and propose a new bilevel distance metric learning model. Specifically,  the lower level characterizes the intrinsic data structure using graph regularized sparse coefficients, while the upper level forces the data samples from the same class to be close to each other and pushes those from different classes far away. \n In addition, leveraging the KKT conditions and the alternating direction method (ADM), we derive an efficient algorithm to solve the proposed new model. Extensive experiments on various occluded datasets demonstrate the effectiveness and robustness of our method.", "full_text": "Bilevel Distance Metric Learning for\n\nRobust Image Recognition\n\n1 School of Electronic Engineering, Xidian University, Xi\u2019an, Shaanxi, China\n\n2 Electrical and Computer Engineering, University of Pittsburgh, USA, 3 JDDGlobal.com\n\nJie Xu1,2, Lei Luo2, Cheng Deng1,\u2217, Heng Huang2,3\u2217\n\njie.xu@pitt.edu,leiluo2017@pitt.edu\n\nchdeng.xd@gmail.com,heng.huang@pitt.edu\n\nAbstract\n\nMetric learning, aiming to learn a discriminative Mahalanobis distance matrix M\nthat can effectively re\ufb02ect the similarity between data samples, has been widely\nstudied in various image recognition problems. Most of the existing metric learning\nmethods input the features extracted directly from the original data in the prepro-\ncess phase. What\u2019s worse, these features usually take no consideration of the local\ngeometrical structure of the data and the noise that exists in the data, thus they\nmay not be optimal for the subsequent metric learning task. In this paper, we\nintegrate both feature extraction and metric learning into one joint optimization\nframework and propose a new bilevel distance metric learning model. Speci\ufb01cally,\nthe lower level characterizes the intrinsic data structure using graph regularized\nsparse coef\ufb01cients, while the upper level forces the data samples from the same\nclass to be close to each other and pushes those from different classes far away.\nIn addition, leveraging the KKT conditions and the alternating direction method\n(ADM), we derive an ef\ufb01cient algorithm to solve the proposed new model. Exten-\nsive experiments on various occluded datasets demonstrate the effectiveness and\nrobustness of our method.\n\n1\n\nIntroduction\n\nMetric learning problem is concerned with learning an optimal distance matrix M that captures the\nimportant relationships among data for a given task, i.e., assigning smaller distances between similar\nitems and larger distances between dissimilar items. Generally, metric learning can be formulated\nas a minimal optimization about the objective function: \u00b5Reg(M) + Loss(M,A), where Reg(M)\nis a regularization term on the distance matrix M and Loss(M,A) is a loss function that penalizes\nconstraints. Different choices of regularization terms and constraints result in various metric learning\nmethods, e.g., large-margin nearest neighbor (LMNN) [16], information-theoretic metric learning\n(ITML) [4], FANTOPE [7], CAP [5], etc. More recent works focus on using maximum correntropy\ncriterion [18], smoothed wasserstein distance [19], matrix variate Gaussian mixture distribution [11]\nfor metric learning formulations to improve the robustness. Although these methods achieve great\nsuccess, they all mainly focus on improving the discriminability of the distance matrix M but ignore\nthe discriminating power of input features. Especially, the descriptors of the sample pairs they address\nare usually extracted directly from the original data in the preprocess phase without considering the\nlocal geometrical structure of the data, thus such descriptors may not be optimal for the subsequent\nmetric learning task.\nBesides metric learning methods, many other machine learning tasks such as clustering and dictionary\nlearning also suffer from the above limitation. To address this issue, the recently proposed solution\n\n\u2217J.X. and L.L. made equal contributions. C.D. and H.H. are corresponding authors.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fis to adopt the strategy of joint learning or bilevel model, and fortunately, great achievements have\nbeen made by many researchers. Wang et al. [15] propose a joint optimization framework in terms of\nboth feature extraction and discriminative clustering. They utilize graph regularized sparse codes as\nthe features, and formulate sparse coding as the constraint for clustering. Zhou et al. [23] present a\nnovel bilevel model-based discriminative dictionary learning method for recognition tasks. The upper\nlevel directly minimizes the classi\ufb01cation error, while the lower level uses the sparsity term and the\nLaplacian term to characterize the intrinsic data structure. Yang et al. [20] propose a bilevel sparse\ncoding model for coupled feature spaces, where they aim to learn dictionaries for sparse modeling\nin both spaces while enforcing some desired relationships between the two signal spaces. All these\nmodels bene\ufb01t from the joint learning strategy or the bilevel model and achieve an overall optimality\nto a great extent. Inspired by these works, we propose to extract features and learn the Mahalanobis\ndistance matrix M through a uni\ufb01ed joint optimization model.\nHow to choose the feature extraction model is also an important problem. Although metric learning\ntask aims to learn a discriminative M, it would be better if the input features are also of discriminating\npower. The common choice is principal component analysis (PCA) feature, which is able to reduce the\ndata dimension and identify the most important features [1]. However, PCA feature is not necessarily\ndiscriminative and also may loss the useful information. More recently, sparse coef\ufb01cients prove to be\nan effective feature which is not only robust to noise but also scalable to high dimensional data [17].\nFurthermore, motivated by recent progress in manifold learning, Zheng et al. [22] incorporate\nthe graph Laplacian into the sparse coding objective function as a regularizer, achieving more\ndiscriminating power compared with traditional sparse coding algorithms.\nIn this paper, we integrate the graph regularized sparse coding model into the distance metric learning\nframework and propose our new bilevel model. The lower level focus on detecting the underlying\ndata structure, while the upper level directly forces the data samples from the same class to be close to\neach other and pushes those samples from different classes far away. Note that the input data samples\nof the upper level are represented by the sparse coef\ufb01cients learnt from the lower level model. And\nbene\ufb01ting from the feature extraction operation of the lower level model, the new features become\nmore robust to noise with the sparsity norm and more discriminative with the Laplacian graph term.\nIn addition, to solve our bilevel model, we transform the lower level problem of the proposed model\ninto equality and inequality constraints and then apply ADM to solve it. Extensive experiments\non various occluded datasets indicate that the proposed bilevel model can achieve more promising\nperformance than other related methods.\nNotations: Let S+ denotes the set of real-valued symmetric positive semi-de\ufb01nite (PSD) matrices.\nFor matrices A and B, denote the Frobenius inner product by (cid:104)A, B(cid:105) = T r(A(cid:62)B), where \u2018T r\u2019\ndenotes the trace of a matrix. For a given vector a = (a1, a2, ..., ad)(cid:62), diag(a) = A corresponds to\na squared diagonal matrix such that \u2200i, Ai,i = ai. ek \u2208 Rk represents a unit vector of length k, and I\nis a unit matrix. Finally, for x \u2208 R, let [x]+ = max(0, x).\n\n2 Bilevel Distance Metric Learning\n\n2.1 Large Margin Nearest Neighbor\nLet {(x1, y1), ..., (xn, yn)} \u2208 Rd \u00d7 C be a set of labeled training data with discrete labels C =\n{1, ..., c}, where n is the number of samples. Most of the metric learning methods aim to learn\n\na metric, such as widely used Mahalanobis distance dM(xi, xj) =(cid:112)(xi \u2212 xj)(cid:62)M(xi \u2212 xj), to\n\neffectively re\ufb02ect the similarity between data.\nLarge margin nearest neighbor (LMNN) [16], as one of the most widely used metric learning methods,\nrequires the learned Mahalanobis distance to satisfy two objectives, i.e., samples from the same class\nare forced to be close to each other and those from different classes are pushed far away. If we denote\nthe similar pairs by S and triplet constraints by T as:\n\nS = {(xi, xj) : yi = yj and xj belongs to the k-neighborhood of xi},\nT = {(xi, xj, xk) : (xi, xj) \u2208 S, yi (cid:54)= yk},\n\n(1)\n\n(2)\n\nthen LMNN model can be formulated as:\n\n(cid:88)\n\n(i,j)\u2208S\n\n(1 \u2212 \u03bb)\n\nmin\nM\u2208S+\n\nd2\nM(xi, xj) + \u03bb\n\n[1 + d2\n\nM(xi, xj) \u2212 d2\n\nM(xi, xk)]+,\n\n(cid:88)\n\n(i,j,k)\u2208T\n\n2\n\n\f(a) Original data X\n\n(b) Sparse coef\ufb01cients A\n\nFigure 1: Original data X and the corresponding sparse coef\ufb01cients A learned from the proposed\nbilevel distance metric learning model. The x-axis represents different samples belonged to eight\nclasses.\n\nwhere \u03bb \u2208 [0, 1] controls the relative weight between two terms. The objective function in Eq. (2)\npulls the \u201ctarget\u201d neighbors whose labels are the same as xi\u2019s toward xi and pushes away the\n\u201cimpostor\u201d neighbors whose labels are different from xi\u2019s.\nAlthough LMNN achieves good results, it learns the distance matrix to characterize the point-to-\npoint distance which is sensitive to the noise. Furthermore, the descriptors of the sample pairs\nit addresses are usually extracted directly from the original data in the preprocess phase without\nconsidering the local geometrical structure of the data. Thus such features may not be optimal for the\nsubsequent metric learning task. To address these problems, in this paper, we propose a bilevel model\nwhich jointly learns the distance matrix M and extracts features under a sparse representation-based\nframework.\n\n2.2 Bilevel Distance Metric Learning\n\n(i,j)\u2208S\n1\n2\n\n(i,j,k)\u2208T\n\n(3)\n\nSparse representations prove to be an effective feature for classi\ufb01cation. Also, some researchers\nsuggest that the contribution of one sample to the reconstruction of another sample is a good indicator\nof similarity between these two samples [3]. Thus the reconstruction coef\ufb01cients can be used to\nconstitute the similarity graph. Inspired by these \ufb01ndings, we integrate both sparse representation and\ngraph regularization into a metric learning framework and propose our new bilevel distance metric\nlearning model.\nWe assume all the data samples X = [x1, x2, ..., xn] \u2208 Rd\u00d7n are represented by their cor-\nresponding sparse coef\ufb01cients A = [a1, a2, ..., an] \u2208 Rk\u00d7n based on a learned dictionary\nU = [u1, u2, ..., uk] \u2208 Rd\u00d7k. Then the proposed bilevel distance metric learning model can\nbe expressed as follows:\n(1 \u2212 \u03bb)\n\n(cid:88)\n\n(cid:88)\n\nd2\nM(ai, aj) + \u03bb\n\n[\u03be + d2\n\nM(ai, aj) \u2212 d2\n\nM(ai, ak)]+\n\nmin\nM\u2208Sd\n\n+,U\n\ns.t. A = arg min\nA\n\n||X \u2212 UA||2\n\nF + \u03b1||A||1 +\n\n\u03b2\n2\n\nT r(ALA(cid:62)),\n\n||ui||2\n\n2 \u2264 1,\u2200i,\n\nwhere the Laplacian term T r(ALA(cid:62)) is introduced to guarantee the sparse coef\ufb01cients can capture\nthe geometric structure of the data. L is the graph Laplacian matrix constructed from the label vector\nY = [y1, y2, ..., yn] \u2208 Rn. \u03bb, \u03b1 and \u03b2 are three regularization parameters.\nIn our bilevel model (3), the upper level feeds the representation (ai, aj, ak) of the triplet constraint\n(xi, xj, xk) into the LMNN model and directly minimizes the loss function. The lower level tries to\ncapture the intrinsic data structure. Note that the Laplacian matrix L is constructed in a supervised\nway, thus the data structure can be well preserved even if there exists noise in the data. By solving\nthe above optimization problem (3), a recognition-driven dictionary U can be learnt and accordingly\nleading to a well representative sparse coef\ufb01cients A. In the meantime, we can also obtain a good\nMahalanobis distance matrix M with the new discriminative feature A.\nIt is worth mentioning that the sparsity penalty and Laplacian regularization encourage the group\nsparsity of coef\ufb01cients, thus the samples from the same class are forced to have similar sparse\n\n3\n\n12345678Class labelOriginal feature-0.1-0.0500.050.10.1512345678Class labelCoefficients-0.200.20.4\frepresentations and those from different classes are to have dissimilar sparse codes. For clarity, we\nshow the original data (including eight classes) and its corresponding sparse coef\ufb01cients learnt by\nour bilevel model in Fig. 1. The coef\ufb01cients equipped with this useful property make the upper level\neasier to ful\ufb01ll its mission which is to force the data samples from the same class to be close to each\nother and pushes those samples from different classes far away.\n\n2.3 Optimization\n\nWe use the alternating direction method (ADM) to solve the optimization problem (3) after some\ndelicate reformulations.\nLet A = B \u2212 C, where B \u2208 Rk\u00d7n and C \u2208 Rk\u00d7n are two nonnegative matrices such that B takes\nall the positive elements in A and the remaining elements of B are set to 0, while C does the same for\nthe negative elements in A (after negation). Then the lower level optimization problem of model (3)\ncan be transformed into the following problem:\n\n1\n2\n\n||X \u2212 UPZ||2\n\nF + \u03b1e(cid:62)\n\nmin\n\n(4)\nwhere Z = [B; C] \u2208 R2k\u00d7n and P = [I,\u2212I] \u2208 Rk\u00d72k. Obviously, problem (4) is a convex problem,\nwhich can be replaced by its KKT conditions [23]. Then we obtain the following equivalent model:\n\n2kZen +\n\nZ\n\ns.t. Z \u2265 0,\n\n\u03b2\n2\n\nT r(PZLZ(cid:62)P(cid:62)),\n(cid:88)\n\n[\u03be + d2\n\nM\u2208Sd\n\nmin\n+,Z,B,U\n\n(1 \u2212 \u03bb)\n\nd2\nP(cid:62)MP(zi, zj) + \u03bb\n\n(i,j,k)\u2208T\n\nP(cid:62)MP(zi, zj) \u2212 d2\n\nP(cid:62)MP(zi, zk)]+\n\n(cid:88)\n\n(i,j)\u2208S\nUPZ \u2212 P\n\ns.t. P\n\n(cid:62)\n\nU\n\n(cid:62)\nB (cid:12) Z = 0, Z \u2265 0, B \u2264 0,\n\n(cid:62)\nX + \u03b1E + \u03b2P\nPZL + B = 0,\n2 \u2264 1,\n||ui||2\n\nU\n\n(cid:62)\n\n(cid:62)\n\n\u2200i \u2208 {1, 2, ..., k}.\n\n(5)\n\nwhere B \u2208 R2k\u00d7n is the Lagrange multiplier matrix and B satis\ufb01es the constraint B \u2264 0. E \u2208\nR2k\u00d7n is an all-one matrix.\nWith all these steps, the proposed bilevel distance metric learning model (3) is reformulated to a\nunilevel optimization problem which can be solved by ADM. We introduce two auxiliary variables\nW and S and relax (5) to the following problem:\n\nmin\n\nM\u2208Sd\n\n+,Z,B,W,S,U\n\n(1 \u2212 \u03bb)\n\nd2\nP(cid:62)MP(zi, zj) + \u03bb\n\n[\u03be + d2\n\nP(cid:62)MP(zi, zj) \u2212 d2\n\nP(cid:62)MP(zi, zk)]+\n\n(cid:88)\n\n(i,j,k)\u2208T\n\ns.t. P\n\n(cid:62)\n\nU\n\n(cid:62)\nZ \u2212 S = 0, S \u2265 0, B \u2264 0,\n\nX + \u03b1E + \u03b2WL + B = 0, B (cid:12) S = 0, P\n\u2200i \u2208 {1, 2, ..., k}.\n\n||ui||2\n\n2 \u2264 1,\n\nU\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\nPZ \u2212 W = 0,\n\n(6)\n\n(cid:88)\n\n(i,j)\u2208S\nUPZ \u2212 P\n\n(cid:88)\n\n(i,j)\u2208S\n(cid:62)\n\n(cid:62)\n\n(cid:88)\n\n(i,j,k)\u2208T\n\nThe augmented Lagrangian function of problem (6) is:\n\nL(Z, B, W, S, U, M, R1, R2, R3, R4, \u00b5)\n=(1 \u2212 \u03bb)\n\nd2\nP(cid:62)MP(zi, zj) + \u03bb\n\n[\u03be + d2\n\nP(cid:62)MP(zi, zj) \u2212 d2\n\nP(cid:62)MP(zi, zk)]+\n\nX + \u03b1E + \u03b2WL + B(cid:105) + (cid:104)R2, B (cid:12) S(cid:105) + (cid:104)R3, P\n\n(cid:62)\n\nPZ \u2212 W(cid:105)\n\n(7)\n\n(cid:62)\n\nU\n\n+(cid:104)R1, P\n+(cid:104)R4, Z \u2212 S(cid:105) +\n(||B (cid:12) S||2\n\n(cid:62)\nU\nUPZ \u2212 P\n(cid:62)\n\nUPZ \u2212 P\n(cid:62)\nU\nPZ \u2212 W||2\n\n||P\nF + ||P\n(cid:62)\n\n\u00b5\n2\n\n+\n\n\u00b5\n2\n\nF + ||Z \u2212 S||2\n\nF ),\n\n(cid:62)\n\n(cid:62)\n\nU\n\nX + \u03b1E + \u03b2WL + B||2\n\nF\n\nwhere R1 \u223c R4 are Lagrange multipliers, and \u00b5 \u2265 0 is the penalty parameter.\nWe alternately update the variables Z, B, W, S, U and M in each iteration by minimizing the\naugmented Lagrangian function of problem (6) with other variables \ufb01xed. We initialize the Maha-\nlanobis distance matrix M as a unit matrix. The initialization processes of the dictionary U and the\ncoef\ufb01cients A are same as in FDDL [21]. More speci\ufb01cally, the iterations go as follows:\nStep 1: Update Z by \ufb01xing B, W, S, U and M. For each zi \u2208 Z, we have\n\nzi = G\u22121\n\n1 (qi + (1 \u2212 \u03bb)P(cid:62)MP\n\nzj + \u03bbP(cid:62)MP\n\n(zj \u2212 zk)),\n\n(8)\n\n(cid:88)\n\n(i,j,k)\u2208T\n\n(cid:88)\n\n(i,j)\u2208S\n\n4\n\n\fwhere G1 = 2\u00b5P(cid:62)U(cid:62)UU(cid:62)UP + 2\u00b5P(cid:62)P + \u00b5I + (1\u2212 \u03bb)(cid:80)\n\n(i,j)\u2208S P(cid:62)MP. qi is the i-th column\nof Q, Q = \u00b5P(cid:62)U(cid:62)UP(P(cid:62)U(cid:62)X \u2212 \u03b1E \u2212 \u03b2WL \u2212 B \u2212 R1/\u00b5) \u2212 P(cid:62)P(R3 \u2212 \u00b5W) \u2212 R4 + \u00b5S.\nStep 2: Update B by \ufb01xing Z, W, S, U and M.\n\nB = \u2212\u03a0+ ((S (cid:12) R2/\u00b5 + G2 + \u03b2WL + R1/\u00b5) (cid:11) (S (cid:12) S + E)) ,\n\n(9)\nwhere G2 = P(cid:62)U(cid:62)UPZ \u2212 P(cid:62)U(cid:62)X + \u03b1E. \u03a0+(\u00b7) is an operator that projects a matrix onto the\nnonnegative cone, which can be de\ufb01ned as follows:\n\n(cid:26)Xij,\n\n0,\n\nif Xij \u2265 0;\notherwise.\n\n\u03a0+(Xij) =\n\n(10)\n\n(11)\n\n(12)\n\nStep 3: Update W by \ufb01xing Z, B, S, U and M.\n\nW =(cid:2)P(cid:62)PZ + R3/\u00b5 \u2212 \u03b2(G2 + B + R1/\u00b5)L(cid:62)(cid:3)(cid:0)\u03b22LL(cid:62) + I(cid:1)\u22121\n\n.\n\nStep 4: Update S by \ufb01xing Z, B, W, U and M.\n\nS = \u03a0+ ((Z + R4/\u00b5 \u2212 B (cid:12) R2/\u00b5) (cid:11) (B (cid:12) B + E)) .\n\nStep 5: Update U by \ufb01xing Z, B, W, S and M. We need to solve the following problem:\n\nwith (cid:79)U = 2(cid:0)U(G(cid:62)\n\n(13)\nU = arg min\nU\u2208\u2126\n2 \u2264 1, i = 1, ..., k}. The problem (13) is a quartic polynomial minimization\nwhere \u2126 = {U | ||Ui||2\nproblem. It is dif\ufb01cult to compute its exact solution. So we use the projected gradient descent method\nto update U:\n\n||G2 + \u03b2WL + B + R1/\u00b5||2\nF ,\n\n6 + G6) + G(cid:62)\n\n4 + G4) + U(G(cid:62)\n\n(14)\n8 + G8) \u2212\n3 P(cid:62) + 4XX(cid:62)U, where G3 = \u03b1E + \u03b2WL + B + R1/\u00b5, G4 = PZZ(cid:62)P(cid:62)U(cid:62)U,\n1 P(cid:62). \u03a0\u2126(U) is\n\n2XG(cid:62)\nG5 = U(cid:62)UPZZ(cid:62)P(cid:62), G6 = PZX(cid:62)U, G7 = U(cid:62)UPZX(cid:62), G8 = PZG(cid:62)\nthe projection of the matrix U onto \u2126 and \u03b71 is a step size.\nStep 6: Update M by \ufb01xing Z, B, W, S and U. The objective function is linear with respect to M ,\nwe directly adopt subgradient descent to update M in each iteration. As before, set zij = zi \u2212 zj,\nthen the subgradient of problem (6) with respect to M can be calculated as follows:\nijP(cid:62) \u2212 Pzikz(cid:62)\n\n(cid:79)M = (1 \u2212 \u03bb)\n\nijP(cid:62) + \u03bb\n\n(Pzijz(cid:62)\n\nPzijz(cid:62)\n\nikP(cid:62)),\n\n(cid:88)\n\n(cid:88)\n\nU = \u03a0\u2126(U \u2212 \u03b71(cid:79)U),\n\n5 + G5)(cid:1) \u2212 4(cid:0)U(G(cid:62)\n\n(cid:1) + 2U(G(cid:62)\n\n7\n\n(15)\n\n(i,j)\u2208S\n\n(i,j,k)\u2208T +\n\nwhere T + denotes the subset of constraints in T that is larger than 0 in function (6). After each\niteration, M is projected onto the positive semide\ufb01nite cone:\n(M \u2212 \u03b72(cid:79)M),\n\n(16)\n(M) is the orthogonal projection of the matrix M \u2208 Sd onto the\n\nM = \u03a0Sd\n\n+\n\n+. The speci\ufb01c procedures are summarized in Algorithm 1.\n\nwhere \u03b72 is a step size, and \u03a0Sd\npositive semide\ufb01nite cone Sd\n\n+\n\n2.4 Classi\ufb01cation Scheme\n\nWhen problem (6) is solved, we obtain a dictionary U and the sparse coef\ufb01cients A = PZ of training\nsamples. In the testing phase, given a testing sample x, we \ufb01rst compute its sparse coef\ufb01cient by the\nvector form of the lower level optimization model:\n\na\u2217 = arg min\n\na\n\n1\n2\n\n||x \u2212 Ua||2\n\nF + \u03b1||a||1 +\n\n\u03b2\n2\n\nqi||a \u2212 ai||2\n2,\n\n(17)\n\n(cid:88)\n\ni\u2208Ns(x)\n\nwhere Ns(x) denotes the set of s nearest neighbors of x and the s nearest neighbors are chosen from\ntraining samples X. ai is the coef\ufb01cient of the i-th training sample xi. qi is the weight between the\n\n5\n\n\fAlgorithm 1 Algorithm to solve Eq. (6)\n1: Input: S,T , X \u2208 Rd\u00d7n, L, \u03bb, \u03b1, \u03b2\n2: Output: M \u2208 Sd\n3: Initialization: M0, U0, A0, Z0 = P\u2020A0, S0 = Z0, W0 = P(cid:62)PZ0, and B0 =\nP(cid:62)(U0)(cid:62)X \u2212 P(cid:62)(U0)(cid:62)U0PZ0 \u2212 \u03b1E \u2212 \u03b2W0L. Set R1 = 0d, R2 = 0d, R3 = 0d,\nR4 = 0d, \u00b50 = 1e \u2212 3, \u00b5max = 1e + 8, \u03c1 = 1.3, \u03b51 = 1e \u2212 4, \u03b52 = 1e \u2212 5, and t = 0.\n\n+, U, A\n\n4: repeat\n5:\n6:\n\n1+\u00b5(cid:0)P(cid:62)(Ut+1)(cid:62)Ut+1PZt+1 \u2212 P(cid:62)(Ut+1)(cid:62)X + \u03b1E + \u03b2Wt+1L + Bt+1(cid:1),\n\nSteps 1\u223c6;\nUpdate Lagrange multipliers and \u00b5t+1:\nRt+1\n2 + \u00b5(Bt+1 (cid:12) St+1), Rt+1\nRt+1\n4 + \u00b5(Zt+1 \u2212 St+1), \u00b5t+1 = min(\u03c1\u00b5t, \u00b5max);\nRt+1\nt \u2190 t + 1;\n7:\n8: until Converge\n\n3 + \u00b5(P(cid:62)PZt+1 \u2212 Wt+1),\n\n1 = Rt\n2 = Rt\n4 = Rt\n\n3 = Rt\n\ntraining sample xi and the test sample x. Note that in the training phase, we construct the weight\nmatrix Q as follows:\n\n(cid:26) 1, if samples xi and xj belong to the same class,\n\n(18)\n\nQij =\n\n0, otherwise.\n\nand Tii =(cid:80)\n\nThen we compute the corresponding Laplacian matrix L = T \u2212 Q, where T is a diagonal matrix\nj Qij. In the testing phase, we \ufb01nd s nearest neighbors from training set for each test\nsample. In the experiment, we set s = 5 and the weight qi = 1 (\u2200i \u2208 Ns(\u00b7)).\nAfter the coef\ufb01cient a\u2217 of the test sample x is obtained, the squared Mahalanobis distance between\nthe test sample x and the training sample xi can be calculated as:\n\nM(a\u2217, ai) = (a\u2217 \u2212 ai)(cid:62)M(a\u2217 \u2212 ai),\nd2\n\n(19)\n\nwhere M is the learned optimal distance matrix. The test sample x is then classi\ufb01ed to the class\nwhere its nearest training sample belongs.\n\n2.5 Convergence Analysis\n\nThere are lots of researchers focusing on the convergence of ADM with two blocks of variables.\nHowever, there is still no af\ufb01rmative convergence proof for multi-block convex minimization problem\nwhere the objective function consists of more than two separable convex functions. The recent\nsolution is to use an additional dual step-size parameter \u00b5 in updating Lagrange multipliers (as shown\nin Algorithm 1). This scheme is simple and effective because it not only requires no additional\nassumptions associated with the objective function but also guarantees the convergence of ADM with\nmulti-block variable under mild assumptions [9]. For these reasons, it is enough that we only need to\nchoose a proper step-size parameter \u00b5 and termination conditions.\nTo further illustrate the convergence of ADM in solving the proposed model (6), we conduct\nseveral experiments on three datasets, including NUST Robust Face database (NUST-RF) [2], OSR\ndataset [13] and PubFig database [6]. Note that there are two environments in NUST-RG database,\ni.e., indoor and outdoor. The objective function values versus number of iterations are shown in Fig. 2.\nFrom the \ufb01gure we can see that the objective values reduce reasonably well.\n\n3 Experimental Results\n\nWe evaluate the proposed algorithm over different classi\ufb01cation databases, including real-world\nmalicious occlusion datasets, contiguous occlusion and corruption datasets. There are two main goals\nin our experiments: \ufb01rst, we will show that our bilevel model is more robust to be applied to solve\nreal-world occlusion problems; second, our model is able to outperform the related metric learning\nmethods.\n\n6\n\n\f(a) Indoor\n\n(b) Outdoor\n\n(c) OSR\n\n(d) PubFig\n\nFigure 2: Objective value vs. the number of iterations.\n\n(a) Indoor\n\n(b) Outdoor\n\nFigure 3: Cropped images of one subject captured in two environments in NUST-RF database, i.e.,\n(a) indoor, and (b) outdoor.\n\n3.1 Compared Methods\n\nWe compare the proposed bilevel distance metric learning model with the following methods: the\nbase-line KNN [14], LMNN [16], FANTOPE [7], CAP [5] and RML [10]. Speci\ufb01cally, as baselines,\nwe consider the most relevant technique from the literature, i.e., k-nearest neighbor method (KNN).\nKNN computes Euclidean distance to measure the similarity between any two images. LMNN is\none of the most widely-used Mahalanobis distance metric learning methods, which uses labeled\ninformation to generate triplet constraints. FANTOPE method is based on LMNN, and it utilizes\na fantope regularization which minimizes sum of k smallest singular values of distance matrix M.\nSame as FANTOPE method, CAP method is also based on LMNN, and it uses a capped trace norm\nto penalize the singular values of distance matrix M that are less than a threshold adaptively learned\nin the optimization. RML learns the discriminative distance matrix by enforcing a margin between\nthe inter-class sparse reconstruction residual and intra-class sparse reconstruction residual.\nFor all metric learners, we use 5-fold cross validation and gauge the average accuracy and stan-\ndard deviation as \ufb01nal performance. All the regularization parameters are tuned from range\n{10\u22124, 10\u22123, 10\u22122, 10\u22121, 1, 10, 102}. For CAP and FANTOPE methods, the parameter rank of\ndistance matrix M is tuned from [10 : 5 : 30]. For a fair comparison, we specify 1 \u201ctarget\u201d neighbor\nfor each training sample for all LMNN related methods. In testing phase, we use 1-NN method.\n\n3.2 Real-World Malicious Occlusion\n\nFirst we consider the NUST Robust Face database (NUST-RF) [2]. It is mainly designed for robust\nface recognition under various occlusions. Except occlusion, it also includes variations of illumination,\nexpression and pose. We use a subset face images of NUST-RF database, and there are 50 subjects\ncaptured in two environments (indoor and outdoor). We manually crop the face portion of the image\nand then normalize it to 80 \u00d7 60 pixels. Fig. 3 shows an example of several selected images of one\nsubject. We extracted LOMO features for each image [8], which not only achieve some invariance to\nviewpoint changes, but also capture local region characteristics of a person. PCA is further applied to\nreduce the feature dimension to 30.\nTable 1 shows the recognition performance of different methods on NUST-RF database of two\nenvironments. Obviously, our method outperforms other competing methods in indoor case and gets\n\n7\n\n05101520# of iterations00.511.52Objective value10401020# of iterations020004000600080001000012000Objective value05101520# of iterations0246810Objective value10401020# of iterations02004006008001000Objective value\fTable 1: Recognition accuracy (%) and standard deviation of different methods on NUST-RF database\nin two environments.\nKNN [14]\n36.14 \u00b1 2.70\n45.24 \u00b1 1.51\n\nFANTOPE [7]\n41.87 \u00b1 2.50\n58.72 \u00b1 1.33\n\nRML [10]\n35.56 \u00b1 3.04\n42.81 \u00b1 2.16\n\nLMNN [16]\n36.20 \u00b1 3.30\n46.01 \u00b1 2.06\n\nIndoor\nOutdoor\n\nCAP [5]\n\n41.70 \u00b1 2.86\n58.34 \u00b1 1.33\n\nProposed\n47.84 \u00b1 1.18\n58.21 \u00b1 1.68\n\n(a) OSR dataset\n\n(b) PubFig dataset\n\nFigure 4: Example pairs of images from two datasets, i.e., (a) OSR dataset, (b) PubFig dataset.\nFor each sub\ufb01gure, from left to right: original image, its noisy versions (with sparse noise, regular\nocclusion and irregular occlusion, respectively).\n\ncomparable results with FANTOPE method in outdoor case. This is because the proposed model\njointly extracts features under a sparse-representation model and performs distance metric learning\ntask at the same time. In this way, our features are more robust to noise, thus we can get better results\nthan LMNN which is only based on extracted features. For FANTOPE and CAP methods, they\nalso achieve relatively good results because the low-rank regularization on M \ufb01ts this face database\njust right. Moreover, as shown in Fig. 3, if occlusions exist, it is unlikely that the test image will\nbe very close to any single training image of the same class, so that the KNN classi\ufb01er performs\npoorly. Although LMNN can improve the recognition rates compared to KNN, their improvements\nare limited. RML also performs poorly because it is based on the MSE criterion which is sensitive to\noutliers.\n\n3.3 Sparse Noise and Contiguous Occlusion\n\nNext we did three groups of occlusion experiments associated with two datasets, i.e., OSR dataset [13]\nand PubFig database [6], to validate the robustness of the proposed algorithm. There are 2688 images\nfrom 8 scene categories in OSR dataset. We extract gist features as representation [12]. For PubFig\ndatabase, we use a subset face images and there are 771 images from 8 face categories [6]. Similarly\nwith NUST-RF database, we extract LOMO features as representation. We simulate various types\nof contiguous occlusion by adding sparse noise to both training and testing data or by replacing a\nrandomly selected local region in each image with an unrelated square block of the \u201cbaboon\u201d image\nfor regular occlusion and a randomly located \u201ctiger\u201d image for irregular occlusion. Sparse noise\nis simulated by 20 adding Gaussian noise with zero mean and 0.01 variance to both training and\ntesting data. And the size of the added image is 60% of the size of the original image. Fig. 4 shows a\nclean image and its noisy versions from two datasets. Since the differences between the pixels of\nthe unrelated \u201cbaboon\u201d image or \u201ctiger\u201d image and the pixels of the images from two datasets are\nrelatively small, the contiguous occlusion caused by these unrelated images is much more challenging\nthan by random black or white dots.\nTable 2 and Table 3 show the classi\ufb01cation accuracy and the standard derivation of different methods\non two datasets, i.e., OSR dataset and PubFig dataset. It is obvious our method consistently outper-\nforms other competing methods in most cases, especially on the occlusion data. This is because our\nbilevel model jointly performs metric learning and extracts features at the same time. And since we\nuse the sparsity penalty and graph regularization in the lower level model, the new features is not\nonly more robust to noise but also discriminative. For this classi\ufb01cation task, both FANTOPE and\nCAP methods are based on LMNN method. Since they all have similar results, which indicates the\nlow-rank regularization on M for Mahalanobis distance metric learning is not particularly effective in\nthis case. Especially for regular occlusion that replaces a randomly selected local region with \u201cbaboon\u201d\nimage and irregular occlusion that replaces local region with \u201ctiger\u201d image, LMNN, FANTOPE and\nCAP achieve almost the same result. For RML method, due to the limitation that RML is based on\nthe MSE criterion, it still performs poorly.\n\n8\n\n\fTable 2: Recognition accuracy (%) and standard deviation of different methods on OSR dataset,\nwhere sparse noise, regular and irregular occlusions are added.\n\nKNN [14]\n69.01 \u00b1 1.96\n61.83 \u00b1 1.75\n55.34 \u00b1 2.72\n52.25 \u00b1 1.74\n\nLMNN [16]\n74.41 \u00b1 1.20\n66.67 \u00b1 1.70\n58.66 \u00b11.31\n57.02 \u00b1 1.80\n\nFANTOPE [7]\n74.97 \u00b1 0.88\n66.70 \u00b1 1.68\n58.73 \u00b1 1.43\n57.10 \u00b1 1.74\n\nCAP [5]\n\n74.45 \u00b1 1.19\n66.67 \u00b1 1.70\n58.70 \u00b1 1.27\n57.13 \u00b1 1.68\n\nRML [10]\n61.34 \u00b1 1.62\n56.57 \u00b1 2.60\n54.38 \u00b1 3.26\n50.45 \u00b1 2.17\n\nProposed\n74.43 \u00b1 1.14\n68.72 \u00b1 2.72\n64.77 \u00b1 1.46\n62.72 \u00b1 3.06\n\nOriginal\nSparse\nRegular\nIrregular\n\nTable 3: Recognition accuracy (%) and standard deviation of different methods on PubFig dataset,\nwhere sparse noise, regular and irregular occlusions are added.\n\nKNN [14]\n56.73 \u00b1 1.12\n48.46 \u00b1 1.35\n35.30 \u00b1 1.14\n40.94 \u00b1 2.30\n\nLMNN [16]\n61.65 \u00b1 1.63\n51.35 \u00b1 1.55\n37.48 \u00b1 1.64\n41.73 \u00b1 3.39\n\nFANTOPE [7]\n61.69 \u00b1 1.60\n51.39 \u00b1 1.69\n37.71 \u00b1 1.56\n41.88 \u00b1 2.78\n\nCAP [5]\n\n61.80 \u00b1 1.72\n51.39 \u00b1 1.99\n38.05 \u00b1 1.57\n42.33 \u00b1 2.35\n\nRML [10]\n55.86 \u00b1 1.54\n48.23 \u00b1 1.26\n35.80 \u00b1 1.59\n40.23 \u00b1 2.48\n\nProposed\n63.46 \u00b1 1.65\n54.40 \u00b1 1.59\n44.29 \u00b1 0.54\n49.10 \u00b1 1.09\n\nOriginal\nSparse\nRegular\nIrregular\n\nTo discuss the in\ufb02uences of individual parameters on the performance of the proposed model, we\ntake PubFig dataset with regular occlusion as an example. We test the in\ufb02uence of parameters \u03bb, \u03b1,\n\u03b2 on the recognition accuracy as shown in Fig. 5.\n\nFigure 5: The in\ufb02uence of parameters \u03bb, \u03b1, \u03b2 on the recognition accuracy of PubFig dataset with\nregular occlusion.\n\n4 Conclusion\n\nWe propose a new bilevel distance metric learning model for robust image recognition task. Different\nfrom conventional metric learning methods which learn a Mahalanobis distance matrix based on\nextracted features, we dig the intrinsic data structures using the Laplacian graph regularized sparse\ncoef\ufb01cients and jointly perform distance metric learning at the same time. Due to the feature\nextraction operation of the lower level model, the new descriptors become more robust to noise with\nthe sparsity norm and more discriminative with the Laplacian graph term, leading to good recognition\nperformance. Moreover, we also derive an ef\ufb01cient algorithm to solve the proposed new model.\nExtensive experiments on several occluded datasets verify the remarkable performance improvements\nled by the proposed bilevel model.\n\nAcknowledgments\n\nJ.X. and C.D. were partially supported by the National Natural Science Foundation of China 61572388,\nthe National Key Research and Development Program of China (2017YFE0104100), and the Key\nR&D Program-The Key Industry Innovation Chain of Shaanxi under Grants 2017ZDCXL-GY-05-04-\n02 and 2018ZDXM-GY-176.\nL.L. and H.H. were partially supported by U.S. NSF-IIS 1836945, NSF-IIS 1836938, NSF-DBI\n1836866, NSF-IIS 1845666, NSF-IIS 1852606, NSF-IIS 1838627, NSF-IIS 1837956.\n\n9\n\n0.20.40.60.81 (=1e-3, =1e-2)404244464850Recognition accuracy (%)10-410-3 (=0.5, =1e-2)3035404550Recognition accuracy (%)10-410-310-210-1 (=0.5, = 1e-3)3035404550Recognition accuracy (%)\fReferences\n[1] Herv\u00e9 Abdi and Lynne J Williams. Principal component analysis. Wiley interdisciplinary reviews:\n\ncomputational statistics, 2(4):433\u2013459, 2010.\n\n[2] Shuo Chen, Jian Yang, Lei Luo, Yang Wei, Kaihua Zhang, and Ying Tai. Low-rank latent pattern\napproximation with applications to robust image classi\ufb01cation. IEEE transactions on image processing,\n26(11):5519\u20135530, 2017.\n\n[3] Bin Cheng, Jianchao Yang, Shuicheng Yan, Yun Fu, and Thomas S Huang. Learning with l1-graph for\n\nimage analysis. IEEE transactions on image processing, 19(4):858\u2013866, 2010.\n\n[4] Jason V Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S Dhillon. Information-theoretic metric\n\nlearning. In ICML, pages 209\u2013216. ACM, 2007.\n\n[5] Zhouyuan Huo, Feiping Nie, and Heng Huang. Robust and effective metric learning using capped trace\n\nnorm: Metric learning via capped trace norm. In SIGKDD, pages 1605\u20131614. ACM, 2016.\n\n[6] Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. Attribute and simile classi\ufb01ers\n\nfor face veri\ufb01cation. In ICCV, pages 365\u2013372. IEEE, 2009.\n\n[7] Marc T Law, Nicolas Thome, and Matthieu Cord. Fantope regularization in metric learning. In CVPR,\n\npages 1051\u20131058. IEEE, 2014.\n\n[8] Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. Person re-identi\ufb01cation by local maximal occurrence\n\nrepresentation and metric learning. In CVPR, pages 2197\u20132206. IEEE, 2015.\n\n[9] Zhouchen Lin, Risheng Liu, and Zhixun Su. Linearized alternating direction method with adaptive penalty\n\nfor low-rank representation. In NIPS, pages 612\u2013620, 2011.\n\n[10] Jiwen Lu, Gang Wang, Weihong Deng, and Kui Jia. Reconstruction-based metric learning for unconstrained\n\nface veri\ufb01cation. IEEE transactions on information forensics and security, 10(1):79\u201389, 2015.\n\n[11] Lei Luo and Heng Huang. Matrix variate gaussian mixture distribution steered robust metric learning. In\n\nAAAI, pages 3722\u20133729, 2018.\n\n[12] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial\n\nenvelope. International journal of computer vision, 42(3):145\u2013175, 2001.\n\n[13] Devi Parikh and Kristen Grauman. Relative attributes. In ICCV, pages 503\u2013510. IEEE, 2011.\n\n[14] Leif E Peterson. K-nearest neighbor. Scholarpedia, 4(2):1883, 2009.\n\n[15] Zhangyang Wang, Yingzhen Yang, Shiyu Chang, Jinyan Li, Simon Fong, and Thomas S Huang. A joint\noptimization framework of sparse coding and discriminative clustering. In IJCAI, pages 3932\u20133938, 2015.\n\n[16] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor\n\nclassi\ufb01cation. JMLR, 10(Feb):207\u2013244, 2009.\n\n[17] John Wright, Allen Y Yang, Arvind Ganesh, S Shankar Sastry, and Yi Ma. Robust face recognition via\n\nsparse representation. TPAMI, 31(2):210\u2013227, 2009.\n\n[18] Jie Xu, Lei Luo, Cheng Deng, and Heng Huang. Robust metric learning model using maximum correntropy\n\ncriterion. In SIGKDD, pages 2555\u20132564. ACM, 2018.\n\n[19] Jie Xu, Lei Luo, and Heng Huang. Multi-level metric learning via smoothed wasserstein distance. In\n\nIJCAI, pages 2919\u20132925, 2018.\n\n[20] Jianchao Yang, Zhaowen Wang, Zhe Lin, Xianbiao Shu, and Thomas Huang. Bilevel sparse coding for\n\ncoupled feature spaces. In CVPR, pages 2360\u20132367. IEEE, 2012.\n\n[21] Meng Yang, Lei Zhang, Xiangchu Feng, and David Zhang. Fisher discrimination dictionary learning for\n\nsparse representation. In ICCV, pages 543\u2013550. IEEE, 2011.\n\n[22] Miao Zheng, Jiajun Bu, Chun Chen, Can Wang, Lijun Zhang, Guang Qiu, and Deng Cai. Graph regularized\nsparse coding for image representation. IEEE transactions on image processing, 20(5):1327\u20131336, 2011.\n\n[23] Pan Zhou, Chao Zhang, and Zhouchen Lin. Bilevel model-based discriminative dictionary learning for\n\nrecognition. IEEE transactions on image processing, 26(3):1173\u20131187, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2068, "authors": [{"given_name": "Jie", "family_name": "Xu", "institution": "Xidian University"}, {"given_name": "Lei", "family_name": "Luo", "institution": "University of Pittsburgh"}, {"given_name": "Cheng", "family_name": "Deng", "institution": "Xidian University"}, {"given_name": "Heng", "family_name": "Huang", "institution": "University of Pittsburgh"}]}