{"title": "Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 576, "page_last": 584, "abstract": "We consider feature selection and weighting for nearest neighbor classifiers. A technical challenge in this scenario is how to cope with the discrete update of nearest neighbors when the feature space metric is changed during the learning process. This issue, called the target neighbor change, was not properly addressed in the existing feature weighting and metric learning literature. In this paper, we propose a novel feature weighting algorithm that can exactly and efficiently keep track of the correct target neighbors via sequential quadratic programming. To the best of our knowledge, this is the first algorithm that guarantees the consistency between target neighbors and the feature space metric. We further show that the proposed algorithm can be naturally combined with regularization path tracking, allowing computationally efficient selection of the regularization parameter. We demonstrate the effectiveness of the proposed algorithm through experiments.", "full_text": "Target Neighbor Consistent Feature Weighting\n\nfor Nearest Neighbor Classi\ufb01cation\n\nIchiro Takeuchi\n\nDepartment of Engineering\n\nNagoya Institute of Technology\n\ntakeuchi.ichiro@nitech.ac.jp\n\nAbstract\n\nMasashi Sugiyama\n\nDepartment of Computer Science\n\nTokyo Institute of Technology\nsugi@cs.titech.ac.jp\n\nWe consider feature selection and weighting for nearest neighbor classi\ufb01ers. A\ntechnical challenge in this scenario is how to cope with discrete update of nearest\nneighbors when the feature space metric is changed during the learning process.\nThis issue, called the target neighbor change, was not properly addressed in the\nexisting feature weighting and metric learning literature. In this paper, we propose\na novel feature weighting algorithm that can exactly and ef\ufb01ciently keep track of\nthe correct target neighbors via sequential quadratic programming. To the best\nof our knowledge, this is the \ufb01rst algorithm that guarantees the consistency be-\ntween target neighbors and the feature space metric. We further show that the\nproposed algorithm can be naturally combined with regularization path tracking,\nallowing computationally ef\ufb01cient selection of the regularization parameter. We\ndemonstrate the effectiveness of the proposed algorithm through experiments.\n\n1 Introduction\n\nNearest neighbor (NN) classi\ufb01ers would be one of the classical and perhaps the simplest non-linear\nclassi\ufb01cation algorithms. Nevertheless, they have gathered considerable attention again recently\nsince they are demonstrated to be highly useful in state-of-the-art real-world applications [1, 2]. For\nfurther enhancing the accuracy and interpretability of NN classi\ufb01ers, feature extraction and feature\nselection are highly important. Feature extraction for NN classi\ufb01ers has been addressed by the name\nof metric learning [3\u20136], while feature selection for NN classi\ufb01ers has been studied by the name of\nfeature weighting [7\u201311].\nOne of the fundamental approaches to feature extraction/selection for NN classi\ufb01ers is to learn the\nfeature metric/weights so that instance pairs in the same class (\u2018must-link\u2019) are close and instance\npairs in other classes (\u2018cannot-link\u2019) are far apart [12, 13]. Although this approach tends to provide\nsimple algorithms, it does not have direct connection to the classi\ufb01cation loss for NN classi\ufb01ers, and\nthus its validity is not clear.\nHowever, directly incorporating the NN classi\ufb01cation loss involves a signi\ufb01cant technical challenge\ncalled the target neighbor (TN) change. To explain this, let us consider binary classi\ufb01cation by a\n3NN classi\ufb01er (see Figure 1). Since the classi\ufb01cation result is determined by the majority vote from\n3 nearest instances, the classi\ufb01cation loss is de\ufb01ned using the distance to the 2nd nearest instance\nin each class (which is referred to as a TN; see Section 2 for details). However, since \u2018nearest\u2019\ninstances are generally changed when feature metric/weights are updated, TNs must also be updated\nto be kept consistent with the learned feature metric/weights during the learning process.\nAlthough the TN change is a fundamental requirement in feature extraction/selection for NN classi-\n\ufb01ers, existing methods did not handle this issue properly. For example, in a seminal feature weight-\ning method called Relief [7, 8], the \ufb01xed TNs determined based on the uniform weights (i.e., the\nEuclidean distance) are used throughout the learning process. Thus, the TN-weight consistency is\n\n1\n\n\fLeft: (a) The Euclidean feature space\nwith w1 = w2 = 1=2. The horizontal\nfeature 1 and the vertical feature 2 are\nregarded as equally important.\nRight:\n(b) A weighted feature space\nwith w1 = 2=3 and w2 = 1=3. The\nhorizontal feature 1 is regarded as more\nimportant than the vertical feature 2.\n\nFigure 1: Illustration of target neighbors (TNs). An instance 0(cid:13) in the middle is correctly classi\ufb01ed\nin 3NN classi\ufb01cation if the distance to the 2nd nearest instance in the same class (called 2nd target\nhit and denoted by h2\n0) is smaller than the distance to the 2nd nearest instance in different classes\n(called 2nd target miss and denoted by m2\nIn the Euclidean feature space (a), the 2nd target\n0).\n0) = ( 2(cid:13); 6 ). Since d(x0; x2jw) > d(x0; x6jw), the instance 0(cid:13) is\nhit/miss are given by (h2\nmisclassi\ufb01ed. On the other hand, in the weighted feature space (b), the 2nd target hit/miss are given\n0) = ( 1(cid:13); 5 ). Since d(x0; x1jw) < d(x0; x5jw), the instance 0(cid:13) is correctly classi\ufb01ed.\nby (h2\n\n0; m2\n\n0; m2\n\nnot guaranteed (large-margin metric learning [5] also suffers from the same drawback). The Simba\nalgorithm [9] is a maximum-margin feature weighting method which adaptively updates TNs in\nthe online learning process. However, the TN-weight consistency is not still guaranteed in Simba.\nI-Relief [10, 11] is a feature weighting method which cleverly avoids the TN change problem by\nconsidering a stochastic variant of NN classi\ufb01ers (neighborhood component analysis [4] also intro-\nduced similar stochastic approximation). However, since the behavior of stochastic NN classi\ufb01ers\ntends to be signi\ufb01cantly different from the original ones, the obtained feature metric/weights are not\nnecessarily useful for the original NN classi\ufb01ers.\nIn this paper, we focus on the feature selection (i.e., feature weighting) scenario, and propose a\nnovel method that can properly address the TN change problem. More speci\ufb01cally, we formulate\nfeature weighting as a regularized empirical risk minimization problem, and develop an algorithm\nthat exactly and ef\ufb01ciently keeps track of the correct TNs via sequential quadratic programming.\nTo the best of our knowledge, this is the \ufb01rst algorithm that systematically handles TN-changes and\nguarantees the TN-weight consistency. We further show that the proposed algorithm can be naturally\ncombined with regularization path tracking [14], allowing computationally ef\ufb01cient selection of\nthe regularization parameter. Finally, we demonstrate the effectiveness of the proposed algorithm\nthrough experiments.\nThroughout the paper, the superscript > indicates the transpose of vectors or matrices. We use R\nand R+ to denote the sets of real numbers and non-negative real numbers, respectively, while we\nuse Nn := f1; : : : ; ng to denote the set of natural numbers. The notations 0 and 1 indicate vectors\nor matrices with all 0 and 1, respectively. The number of elements in a set S is denoted by jSj.\n\n2 Preliminaries\n\nIn this section, we formulate the problem of feature weighting for nearest neighbor (NN) classi\ufb01ca-\ntion, and explain the fundamental concept of target neighbor (TN) change.\n:=\nConsider a classi\ufb01cation problem from n training instances with \u2018 features.\n[xi1 : : : xi\u2018]> 2 R\u2018 be the i-th training instance and yi be the corresponding label. The squared\n(xij (cid:0) xi0j)2, while the weighted\nEuclidean distance between two instances xi and xi0 is\nsquared Euclidean distance is written as\nd(xi; xi0jw) :=\n\n(1)\nj2N\u2018\nwhere w := [w1 : : : w\u2018] 2 [0; 1]\u2018 is an \u2018-dimensional vector of non-negative weights and \"i;i0 :=\n0) 2 Nn (cid:2) Nn, is introduced for notational simplicity.\n[(xi1 (cid:0) xi01)2 : : : (xi\u2018 (cid:0) xi0\u2018)2]> 2 R\u2018, (i; i\nWe develop a feature weighting algorithm within the framework of regularized empirical risk mini-\nmization, i.e., minimizing the linear combination of a loss term and a regularization term. In order to\nformulate the loss term for NN classi\ufb01cation, let us introduce the notion of target neighbors (TNs):\n\nj2N\u2018\nwj(xij (cid:0) xi0j)2 = \"\n\n\u2211\n\nLet xi\n\n\u2211\n\n>\ni;i0w;\n\n2\n\nh20m20(cid:18)(cid:19)(cid:20)(cid:21)(cid:23)(cid:24)(cid:25)(cid:26)(cid:22)h20m20(cid:18)(cid:19)(cid:20)(cid:21)(cid:23)(cid:24)(cid:25)(cid:26)(cid:22)\fDe\ufb01nition 1 (Target neighbors (TNs)) De\ufb01ne Hi := fh 2 Nnjyh = yi; h 6= ig and Mi := fm 2\nNnjym 6= yig for i 2 Nn. Given a weight vector w, an instance h 2 Hi is said to be the (cid:20)-th target\nhit of an instance i if it is the (cid:20)-th nearest instance among Hi, and m 2 Mi is said to be the (cid:21)-th\ntarget miss of an instance i if it is the (cid:21)-th nearest instance among Mi, where the distance between\ninstances are measured by the weighted Euclidean distance (1). The (cid:20)-th target hit and (cid:21)-th target\nmiss of an instance i 2 Nn are denoted by h(cid:20)\ni , respectively. Target hits and misses are\ncollectively called as target neighbors (TNs) 1.\n\ni and m(cid:21)\n\n\u2211\n\ni\n\n(cid:0)1\n\ni2Nn\n\nIfd(xi; xh(cid:20)\n\njw) > d(xi; xm(cid:21)\n\nUsing TNs, the misclassi\ufb01cation rate of a binary kNN classi\ufb01er when k is odd is formulated as\njw)g with (cid:20) = (cid:21) = (k + 1)=2; where I((cid:1))\nLkNN(w) := n\nis the indicator function with I(z) = 1 if z is true and I(z) = 0 otherwise. For example, in binary\n3NN classi\ufb01cation, an instance is misclassi\ufb01ed if and only if the distance to the 2nd target hit is\nlarger than the distance to the 2nd target miss (see Figure 1). The misclassi\ufb01cation cost of a multi-\nclass problem can also be formulated by using TNs similarly, but we omit the details for the sake of\nsimplicity.\nSince the indicator function I((cid:1)) included in the loss function LkNN(w) is hard to directly deal with,\nwe introduce the nearest neighbor (NN) margin2 as a surrogate:\n\ni\n\n(cid:0)1\n\ni\n\ni\n\ni\n\ni\n\n)\n\njw)\n\n\u2211\n\nDe\ufb01nition 2 (Nearest neighbor (NN) margin) Given a weight vector w, the ((cid:20); (cid:21))-neighbor mar-\ngin is de\ufb01ned as d(xi; xm(cid:21)\n\n(\njw) for i 2 Nn, (cid:20) 2 NjHij, and (cid:21) 2 NjMij.\n\njw) (cid:0) d(xi; xh(cid:20)\n\njw) (cid:0)\nBased on the NN margin, our loss function is de\ufb01ned as L(w) := n\nd(xi; xm(cid:21)\n: By minimizing L(w), the average ((cid:20); (cid:21))-neighbor margin over all instances is max-\nimized. This loss function allows us to \ufb01nd feature weights such that the distance to the (cid:20)-th target\nhit is as small as possible, while the distance to the (cid:21)-th target miss is as large as possible.\nA regularization term is introduced for incorporating our prior knowledge on the weight vector. Let\n(cid:22)w 2 [0; 1]\u2018 be our prior weight vector, and we use the regularization term of the form (cid:10)(w) :=\njjw(cid:0) (cid:22)wjj2\n(cid:0)11, it implies that our baseline choice of the feature\n1\n2\nweights is uniform, i.e., the Euclidean distance metric [6].\nGiven the loss term L(w) and the regularization term (cid:10)(w), the feature weighting problem we are\ngoing to study in this paper is formulated as\njw) (cid:0) d(xi; xm(cid:21)\n\n2: For example, if we choose (cid:22)w := \u2018\n\nw = 1; w (cid:21) 0;\n\njjw (cid:0) (cid:22)wjj2\n\n\u2211\n\nd(xi; xh(cid:20)\n\nd(xi; xh(cid:20)\n\ns:t: 1\n\njw)\n\ni2Nn\n\n)\n\n(\n\n(cid:0)1\n\n(cid:18)n\n\n(2)\n\n>\n\n2\n\nmin\nw\n\ni2Nn\n\ni\n\ni\n\n+\n\n1\n2\n\nwhere (cid:18) 2 R+ is a regularization parameter for controlling the balance between the loss term L(w)\nand the regularization term (cid:10)(w). The \ufb01rst equality constraint restricts that the sum of the weights\nto be one, while the second constraint indicates that the weights are non-negative. The former is\nintroduced for \ufb01xing the scale of the distance metric.\ni )gi2Nn are dependent on the weights w because the\nIt is important to note that TNs f(h(cid:20)\nweighted Euclidean distance (1) is used in their de\ufb01nitions. Thus, we need to properly update TNs\nin the optimization process. We refer to this problem as the target neighbor change (TN-change)\nproblem. Since TNs change in a discrete fashion with respect to the weights w, the problem (2) has\na non-smooth and non-convex objective function. In the next section, we introduce an algorithm for\n\ufb01nding a local minimum solution of (2). An advantage of the proposed algorithm is that it monoton-\nically decreases the objective function in (2), while TNs are properly updated so that they are always\nkept consistent with the feature space metric given by the weights w in the following sense:\n\ni ; m(cid:21)\n\nDe\ufb01nition 3 (TN-weight Consistency) A weight\nf(h(cid:20)\nthe distance is measured by the weighted Euclidean distance (1) using the weights w.\n\ni )gi2Nn are said to be TN-weight consistent\n\nvector w and n pairs\n\ninstances\ni )gi2Nn are the TNs when\n\nif f(h(cid:20)\n\ni ; m(cid:21)\n\ni ; m(cid:21)\n\nof\n\n1The terminologies target hit and miss were \ufb01rst used in [7], in which only the 1st target hit and miss were\nconsidered. We extend them to the (cid:20)-th target hit and (cid:21)-th target miss for general (cid:20) and (cid:21). The terminology\ntarget neighbors (TNs) was \ufb01rst used in [5].\n\n2The notion of the nearest neighbor margin was \ufb01rst introduced in [9], where only the case of (cid:20) = (cid:21) = 1\n\nwas considered. We use an extended de\ufb01nition with general (cid:20) and (cid:21).\n\n3\n\n\fFigure 1 illustrates how TNs are de\ufb01ned. In the Euclidean feature space with w1 = w2 = 1=2, the\n2nd target hit and miss of the instance 0(cid:13) are given by (h2\n0) = ( 2(cid:13); 6 ). Since d(x0; x2jw) >\nd(x0; x6jw), the instance 0(cid:13) is misclassi\ufb01ed in 3NN classi\ufb01cation. On the other hand, in the\nweighted feature space with (w1; w2) = (2=3; 1=3), the 2nd target hit and miss of the instance\n0(cid:13) are given by (h2\n0) = ( 1(cid:13); 5 ). Since d(x0; x1jw) < d(x0; x5jw) under this weighted metric,\nthe instance 0(cid:13) is correctly classi\ufb01ed in 3NN classi\ufb01cation.\n\n0; m2\n\n0; m2\n\n3 Algorithm\n\nThe problem (2) can be formulated as a convex quadratic program (QP) if TNs are regarded as \ufb01xed.\nBased on this fact, our feature weighting algorithm solves a sequence of such QPs, while TNs are\nproperly updated to be always consistent.\n\n3.1 Active Set QP Formulation\n\nw2R\u2018;(cid:24)2Rn;(cid:17)2Rn\n\ni2Nn\n\n1\n2\n\nFirst, we study the problem (2) under the condition that TNs remain unchanged. Let us de\ufb01ne the\nfollowing sets of indices:\nDe\ufb01nition 4 Given a weight vector w and the consistent TNs f(h(cid:20)\nsets of index pairs for \u2018(cid:3)\u2019 being \u2018<\u2019, \u2018=\u2019, and \u2018>\u2019:\n\ni ; m(cid:21)\nH[(cid:3)] := f(i; h) 2 Nn (cid:2) Hi j d(xi; xhjw) (cid:3) d(xi; xh(cid:20)\nM[(cid:3)] := f(i; m) 2 Nn (cid:2) Mi j d(xi; xmjw) (cid:3) d(xi; xm(cid:21)\n\ni\n\ni\n\ni )gi2Nn, de\ufb01ne the following\njw)g;\njw)g:\n\ni\n\nThey are collectively denoted by (H;M), where H := fH[<];H[=];H[>]g and M :=\nfM[<];M[=];M[>]g. Furthermore, for each i 2 Nn, we de\ufb01ne H[(cid:3)]\n:= fhj(i; h) 2 H[(cid:3)]g and\nM[(cid:3)]\n:= fmj(i; m) 2 M[(cid:3)]g.\nUnder the condition that f(h(cid:20)\n\u2211\nwritten as\n\ni )gi2Nn remain to be TN-weight consistent, the problem (2) is\n((cid:24)i (cid:0) (cid:17)i) +\n\njjw (cid:0) (cid:22)wjj2\n\ni ; m(cid:21)\n\nmin\n\n(3a)\n\n(cid:0)1\n\n(cid:18)n\n\ni\n\n2\n\ns:t:\n\nw = 1; w (cid:21) 0;\n\n>\n1\nd(xi; xhjw) (cid:20) (cid:24)i; (i; h) 2 H[<]; d(xi; xmjw) (cid:20) (cid:17)i; (i; m) 2 M[<];\nd(xi; xhjw) = (cid:24)i; (i; h) 2 H[=]; d(xi; xmjw) = (cid:17)i; (i; m) 2 M[=];\nd(xi; xhjw) (cid:21) (cid:24)i; (i; h) 2 H[>]; d(xi; xmjw) (cid:21) (cid:17)i; (i; m) 2 M[>]:\n\n(3b)\n(3c)\n(3d)\n(3e)\nIn the above, we introduced slack variables (cid:24)i and (cid:17)i for i 2 Nn which represent the weighted\ndistances to the target hit and miss, respectively. In (3), TN-weight consistency is represented by a\nset of linear constraints (3c)\u2013(3e)3.\nOur algorithm handles TN change as a change in the index sets (H;M), and a sequence of convex\nQPs in the form of (3) are (partially) solved every time the index sets (H;M) are updated. We\nimplement this approach by using an active set QP algorithm (see Chapter 16 in [15]). Brie\ufb02y,\nthe active set QP algorithm repeats the following two steps: (step1) Estimate the optimal active\nset4, and (step2) Solve an equality-constrained QP by regarding the constraints in the current active\nset as equality constraints and all the other non-active constraints are temporarily disregarded. An\nadvantage of introducing the active set QP algorithm is that TN change can be naturally handled as\nactive set change. Speci\ufb01cally, a change of target hits is interpreted as an exchange of the members\nbetween H[<] and H[=] or between H[>] and H[=], while a change of target misses is interpreted as\nan exchange of the members between M[<] and M[=] or between M[>] and M[=].\n\n3Note that the constraints for (H[<];H[=];H[>]) in (3c)\u2013(3e) restrict that h must remain to be the target\nhit of i for all (i; h) 2 H[=] because those closer than the target hit must remain to be closer and those more\ndistant than the target hit must remain to be more distant. Similarly, the constraints for (M[<];M[=];M[>])\nin (3c)\u2013(3e) restrict that m must remain to be the target miss of i for all (i; m) 2 M[=].\n\n4A constraint satis\ufb01ed with equality is called active and the set of active constraints is called active set.\n\n4\n\n\f3.2 Sequential QP-based Feature Weighting Algorithm\n\nHere, we present our feature weighting algorithm. We \ufb01rst formulate the equality-constrained QP\n(EQP) of (3). Then we present how to update the EQP by changing the active sets.\nIn order to formulate the EQP of (3), we introduce another pair of index sets Z := fjjwj = 0g\nand P := fjjwj > 0g. Suppose that we currently have a solution (w; (cid:24); (cid:17)) and the active set\n(H[=];M[=];Z). We \ufb01rst check whether the solution minimizes the loss function (3a) in the sub-\nspace de\ufb01ned by the active set. If not, we compute a step ((cid:1)w; (cid:1)(cid:24); (cid:1)(cid:17)) by solving an EQP:\n\n\u2211\n\nmin\n\n(cid:1)w;(cid:1)(cid:24);(cid:1)(cid:17)\n\njj(w + (cid:1)w) (cid:0) (cid:22)wjj2\n\n2\n\n(cid:18)n\n\n(cid:0)1\n\ni2Nn\n\n(((cid:24)i + (cid:1)(cid:24)i) (cid:0) ((cid:17)i + (cid:1)(cid:17)i)) +\n\n1\n2\n>(w + (cid:1)w) = 1; wj + (cid:1)wj = 0; j 2 Z;\n1\ni;h(w + (cid:1)w) = (cid:24)i + (cid:1)(cid:24)i; (i; h) 2 H[=]; \"\n>\n\ns:t:\n\n\"\n\n(4)\ni;m(w + (cid:1)w) = (cid:17)i + (cid:1)(cid:17)i; (i; m) 2 M[=]:\n>\nThe solution of the EQP (4) can be analytically obtained by solving a small linear system (see\nSupplement A).\nNext, we decide how far we can move the solution along this direction. We set w w+(cid:28)(cid:1)w; (cid:24) \n(cid:24) + (cid:28)(cid:1)(cid:24); (cid:17) (cid:17) + (cid:28)(cid:1)(cid:17); where (cid:28) 2 [0; 1] is the step-length determined by the following lemma.\nLemma 5 The maximum step length that satis\ufb01es feasibility and TN-weight consistency is given by\n\n(\n\n(cid:28) := min\n\n1; min\n\nj2P;(cid:1)wj <0\n\n(cid:0)wj\n(cid:1)wj\n\n;\n\nmin\n(i;h)2H[<];\"\n>\ni;h(cid:1)w>(cid:1)(cid:24)i\n\nmin\n(i;m)2M[<];\"\n>\ni;m(cid:1)w>(cid:1)(cid:17)i\n\n;\n\n(cid:0)(\"\ni;hw (cid:0) (cid:24)i)\n>\ni;h(cid:1)w (cid:0) (cid:1)(cid:24)i\n>\n\"\n(cid:0)(\"\ni;mw (cid:0) (cid:17)i)\n>\ni;m(cid:1)w (cid:0) (cid:1)(cid:17)i\n>\n\"\n\nmin\n(i;h)2H[>];\"\n>\ni;h(cid:1)w<(cid:1)(cid:24)i\n\n;\n\nmin\n(i;m)2M[>];\"\n>\ni;m(cid:1)w<(cid:1)(cid:17)i\n\n(cid:0)(\"\ni;hw (cid:0) (cid:24)i)\n>\ni;h(cid:1)w (cid:0) (cid:1)(cid:24)i\n>\n\"\n\n;\n\n(cid:0)(\"\ni;mw (cid:0) (cid:17)i)\n>\ni;m(cid:1)w (cid:0) (cid:1)(cid:17)i\n>\n\"\n\n(5)\n\n)\n\n:\n\nThe proof of the lemma is presented in Supplement B.\nIf (cid:28) < 1, the constraint for which the minimum in (5) is achieved (called the blocking constraint) is\nadded to the active set. For example, if (i; h) 2 H[>] achieved the minimum in (5), (i; h) is moved\nfrom H[>] to H[=]. We repeat this by adding constraints to the active set until we reach the solution\n(w; (cid:24); (cid:17)) that minimizes the objective function over the current active set.\nNext, we need to consider whether the objective function of (2) can be further decreased by removing\nconstraints in the active set. Our algorithm and the standard active set QP algorithm are different\nin this operation: in our algorithm, an active constraint is allowed to be inactive only when the (cid:20)-th\ntarget hit remains to be a member of H[=] and the (cid:21)-th target miss remains to be a member of M[=].\nLet us introduce the Lagrange multipliers (cid:11) 2 RjZj, (cid:12) 2 RjH[=]j, and (cid:13) 2 RjM[=]j for the 2nd, the\n3rd, and the 4th constraint in (4), respectively (see Supplement A for details). Then the following\nlemma tells us which active constraint should be removed.\n\nLemma 6 The objective function in (2) can be further decreased while satisfying feasibility and\nTN-weight consistency by removing one of the constraints in the active set with the following rules5:\n(cid:15) If (cid:11)j > 0 for j 2 Z, then move fjg to P;\n(cid:15) If (cid:12)(i;h) < 0, jH[<]\nj (cid:20) (cid:20) (cid:0) 2 and jH[=]\n(cid:15) If (cid:12)(i;h) > 0, jH[>]\nj < jHij (cid:0) (cid:20) and jH[=]\n(cid:15) If (cid:13)(i;m) < 0, jM[<]\nj (cid:20) (cid:21) (cid:0) 2 and jM[=]\n(cid:15) If (cid:13)(i;m) > 0, jM[>]\nj < jMij (cid:0) (cid:21) and jM[=]\n\nj (cid:21) 2 for (i; h) 2 H[=], then move (i; h) to H[>];\nj (cid:21) 2 for (i; m) 2 M[=], then move (i; m) to M[<];\n\nj (cid:21) 2 for (i; m) 2 M[=], then move (i; m) to M[>].\n\nj (cid:21) 2 for (i; h) 2 H[=], then move (i; h) to H[<];\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n5If multiple active constraints are selected by these rules, the one with the largest absolute Lagrange multi-\n\nplier is removed from the active set.\n\n5\n\n\fThe proof of the lemma is presented in Supplement C.\nThe proposed feature weighting algorithm, which we call Sequential QP-based Feature Weighting\n(SQP-FW) algorithm, is summarized in Algorithm 1. The proposed SQP-FW algorithm possesses\n\nAlgorithm 1 Sequential QP-based Feature Weighting (SQP-FW) Algorithm\nInputs: The training instances f(xi; yi)gi2Nn, the neighborhood parameters ((cid:20); (cid:21)), regularization\nparameter (cid:18), and initial weight vector (cid:22)w;\nInitialize w (cid:22)w, ((cid:24); (cid:17)) and (H;M;Z;P);\nfor t = 1; 2; : : : do\n\nSolve (4) to \ufb01nd ((cid:1)w; (cid:1)(cid:24); (cid:1)(cid:17));\nif ((cid:1)w; (cid:1)(cid:24); (cid:1)(cid:17)) = 0 then\n\nCompute Lagrange multipliers (cid:11), (cid:12), and (cid:13);\nif none of the active constraints satis\ufb01es the rules in Lemma 6 then\n\nelse\n\nstop with solution w\nUpdate (H;M;Z;P) according to the rule in Lemma 6;\n\n(cid:3) = w;\n\nelse\n\nCompute the step size (cid:28) as in Lemma 5;\nif there are blocking constraints then\n\nUpdate (H;M;Z;P) by adding one of the blocking constraints in Lemma 5;\n\nOutputs: A local optimal vector of feature weights w\n\n(cid:3).\n\nthe following useful properties.\nOptimality conditions: We can characterize a local optimal solution of the non-smooth and non-\nconvex problem (2) in the following theorem (its proof is presented in Supplement D):\nw = 1 and w (cid:21) 0,\nTheorem 7 (Optimality condition) Consider a weight vector w satisfying 1>\ni )gi2Nn, and the index sets (H;M;Z;P). Then, w is a local minimum\nthe consistent TNs f(h(cid:20)\nsolution of the problem (2) if and only if the EQP (4) has the solution ((cid:1)w; (cid:1)(cid:24); (cid:1)(cid:17)) = 0 and there\nare no active constraints that satisfy the rules in Lemma 6.\n\ni ; m(cid:21)\n\nThis theorem is practically useful because it guarantees that the solution cannot be improved in its\nneighborhood even if some of the current TNs are replaced with others. Without such an optimality\ncondition, we must check all possible combinations of TN change from the current solution in a trial\nand error manner. The above theorem allows us to avoid such time-consuming procedure.\nFinite termination property: It can be shown that the SQP-FW algorithm converges to a local\nminimum solution characterized by Theorem 7 in a \ufb01nite number of iterations based on the similar\nargument as that in pages 477\u2013478 in [15]. See Supplement E for details.\nComputational complexity: When computing the solutions ((cid:1)w; (cid:1)(cid:24); (cid:1)(cid:17)) and the Lagrange mul-\ntipliers ((cid:11); (cid:12); (cid:13)) by solving the EQP (4), the main computational cost is only several matrix-vector\nmultiplications involving n (cid:2) jPj and n (cid:2) jZj matrices, which is linear with respect to n (see Sup-\nplement A for details). On the other hand, if the minimum step length (cid:28) is computed naively by\nLemma 5, it takes O(n2jPj) computations, which could be a bottleneck of the algorithm. However,\nthis bottleneck can be eased by introducing a working set approach: only a \ufb01xed number of con-\nstraints in the working set are evaluated at each step, while the working set is updated, say, every\n100 steps. In our implementation, we introduced such working sets to H[>] and M[>]. For each\ni 2 Nn, these working sets contain, say, only top 100 nearest instances. This strategy is based on\na natural idea that those outside of the top 100 nearest instances would not become TNs in the next\n100 steps. Such a working set strategy allows us to reduce the computational complexity to O(njPj)\nfor computing the the minimum step length (cid:28), which is linear with respect to n.\nRegularization path tracking: The SQP-FW algorithm can be naturally combined with regular-\nization path tracking algorithm for computing a path of the solutions that satisfy the optimality\ncondition in Theorem 7 for a range of regularization parameter (cid:18). Due to the space limitation, we\nonly describe the outline here (see Supplement F for details). The algorithm starts from a local\noptimal solution for a \ufb01xed regularization parameter (cid:18). Then, the algorithm continues \ufb01nding the\noptimal solutions when (cid:18) is slightly increased. It can be shown that the local optimal solution of (2)\n\n6\n\n\fis a piecewise-linear function of (cid:18) as long as the TNs remain unchanged. If (cid:18) is further increased,\nwe encounter a point at which TNs must be updated. Such TN changes can be easily detected and\nhandled because the TN-weight consistency conditions are represented by a set of linear constraints\n(see (3c)\u2013(3e)), and we already have explicit rules (Lemmas 5 and 6) for updating the constraints.\nThe regularization path tracking algorithm provides an ef\ufb01cient and insightful approach for model\nselection.\n\n4 Experiments\n\nIn this section, we investigate the experimental performance of the proposed algorithm6.\n\n4.1 Comparison Using UCI Data Sets\n\ni ; m1\n\ni ; m1\n\nFirst, we compare the proposed SQP-FW algorithm with existing feature weighting algorithms,\nwhich handle the TN-change problem in different ways.\n(cid:15) Relief [7, 8]: The Relief algorithm is an online feature weighting algorithm. The goal of Relief\nis to maximize the average (1; 1)-neighbor margin over instances. The TNs f(h1\ni )gi2Nn are\ndetermined by the initial Euclidean metric and \ufb01xed all through the training process.\n(cid:15) Simba [9]: Simba is also an online algorithm aiming to maximize the average (1; 1)-neighbor\nmargin. The key difference from Relief is that TNs f(h1\ni )gi2Nn are updated in each step using\nthe current feature-space metric. The TN-change problem is alleviated in Simba by this reassign-\nment.\n(cid:15) MulRel: To mitigate the TN-weight inconsistency in Relief, we repeat the Relief procedure using\nthe TNs de\ufb01ned by the learned weights in the previous loop (see also [5]).\n(cid:15) NCA-D [4]: Neighborhood component analysis with diagonal metric, which is essentially the same\nas I-Relief [10, 11]. Instead of discretely assigning TNs, the probability of an instance being TNs\nis considered. Using these stochastic neighbors, the average margin is formulated as a continuous\n(non-convex) function of the weights, by which the TN change problem is mitigated.\nWe compared the NN classi\ufb01cation performance of these 4 algorithms and the SQP-FW algorithm\non 10 UCI benchmark data sets summarized in Table 1. In each data set, we randomly divided the\nentire data set into the training, validation, and test sets with equal sizes. The number of neighbors\nk 2 f1; 3; 5g was selected based on the classi\ufb01cation performance on the validation set.\nIn the SQP-FW algorithm, the neighborhood parameter ((cid:20); (cid:21)) and the regularization parameter (cid:18)\nwere also determined to maximize the classi\ufb01cation accuracy on the validation set. The neighbor-\nhood parameter ((cid:20); (cid:21)) were chosen from f(1; 1); (2; 2); (3; 3)g, while (cid:18) was chosen from 100 evenly\nallocated candidates in log-scale between 10(cid:0)3 and 100. The working set strategy was used when\nn > 1000 with the working set size 100 and the working set update frequency 100.\nAll the 4 existing algorithms do not have explicit hyper-parameters. However, since these algorithms\nalso have the risk of over\ufb01tting, we removed features with small weights, following the recommen-\ndation in [7, 11]. We implemented this heuristic for all the 4 existing algorithms by optimizing\nthe percentage of eliminating features (chosen from f0%; 1%; 2%; : : : ; 99%g) based on the classi-\n\ufb01cation performance on the validation set. Since Simba and NCA are formulated as non-convex\noptimization problems and solutions may be trapped in local minima, we ran these two algorithms\nfrom \ufb01ve randomly selected starting points and the solution with the smallest training error was\nadopted. The number of iterations in Relief (and the inner-loop iteration of MulRel as well) and\nSimba was set to 1000, and the outer-loop iteration of MulRel was set to 100.\nThe experiments were repeated 10 times with random data splitting, and the average performance\nwas reported. To see the statistical signi\ufb01cance of the difference, paired-sample t-test was con-\nducted. All the features were standardized to have zero mean and unit variance. Table 1 summarizes\nthe results, showing that the SQP-FW algorithm compares favorably with other methods.\n\n6See also Supplement G for an illustration of the behavior of the proposed algorithm using an arti\ufb01cial\n\ndataset.\n\n7\n\n\fTable 1: Average misclassi\ufb01cation rate of kNN classi\ufb01er on 10 UCI benchmark data sets.\n\nN.C.\n\nBre. Can. Dia.\n\nAbbreviated Data Name\n\nCon. Ben.\nIma. Seg.\nIonosphere\nPag. Blo. Cla.\n\nNCA-D\n0.058\n0.276\n0.049\n*0.097\n0.044\n0.128\n0.029\n0.112\n0.195\n0.495\n\u2019S.S.\u2019 and \u2019N.C.\u2019 stand for sample size and the number of classes, respectively. Asterisk \u2019*\u2019 indicates the best\namong 5 algorithms, while boldface means no statistical difference from the best (p-value (cid:21) 0:05).\n\nMulRel\n0.056\n0.294\n0.065\n0.138\n0.053\n0.109\n0.020\n0.117\n0.227\n0.494\n\nRelief\n0.047\n0.227\n*0.049\n0.162\n0.048\n0.117\n0.012\n0.108\n0.202\n0.499\n\nSQP-FW\n*0.040\n*0.221\n0.052\n0.122\n0.046\n*0.102\n*0.011\n*0.104\n*0.184\n*0.463\n\nSimba\n0.046\n0.230\n0.061\n0.115\n*0.044\n0.123\n0.012\n0.110\n0.217\n0.471\n\nS.S.\n569\n208\n2310\n351\n5473\n195\n10992\n4601\n5000\n6497\n\n\u2018\n30\n60\n18\n33\n10\n22\n16\n57\n21\n11\n\n2\n2\n7\n2\n5\n2\n10\n2\n3\n7\n\nWin. Qua.\n\nParkinson\n\nPen. Rec. Han. Dig.\n\nSpambase\n\nWav. Dat. Gen. ver1\n\nTable 2: Results on Microarray Data Experiments\n\nMicroarray Data Name\n\nColon Cancer [16]\nKidney Cancer [17]\n\nLeukemia [18]\n\nProstate Cancer [19]\n\nS.S.\n62\n74\n72\n102\n\n2000\n4224\n7129\n12600\n\n2\n3\n2\n2\n\n\u2018\n\nN.C.\n\nError\n\nError\n\nStandard 1NN Weighted 1NN with SQP-FW\nMed. #(genes)\n0.180 (cid:6) 0.059\n0.075 (cid:6) 0.043\n0.108 (cid:6) 0.022\n0.230 (cid:6) 0.048\n\n0.140 (cid:6) 0.065\n0.050 (cid:6) 0.038\n0.088 (cid:6) 0.036\n0.194 (cid:6) 0.052\n\n20\n10\n14\n24\n\nrespectively.\nindicates the median number of genes selected by SQP-FW algorithm over 10 runs.\n\n\u2019Error\u2019 represents the misclassi\ufb01cation error rate of 1NN classi\ufb01er, while \u2019Med. #(genes)\u2019\n\n4.2 Application to Feature Selection Problem in High-Dimensional Microarray Data\n\nIn order to illustrate feature selection performance, we applied the SQP-FW algorithm to microarray\nstudy, in which simple classi\ufb01cation algorithms are often preferred because the number of features\n(genes) \u2018 is usually much larger than the number of instances (patients) n. Since biologists are inter-\nested in identifying a set of genes that governs the difference among different biological phenotypes\n(such as cancer subtypes), selecting a subset of genes that yields good NN classi\ufb01cation performance\nwould be practically valuable.\nFor each of the four microarray data sets in Table 2, we divided the entire set into the training and\ntest sets with size ratio 2:1 [2]. We compared the test set classi\ufb01cation performance between the\nplain 1NN classi\ufb01er (without feature weighting) and the weighted 1NN classi\ufb01er with the weights\ndetermined by the SQP-FW algorithm. In the latter, the neighborhood parameters were \ufb01xed to\n(cid:20) = (cid:21) = 1 and (cid:18) was determined by 10-fold cross validation within the training set. We repeated\nthe data splitting 10 times and the average performance was reported.\nTable 2 summarizes the results. The median numbers of selected genes (features with nonzero\nweights) by the SQP-FW algorithm are also reported in the table. Although the improvements of the\nclassi\ufb01cation performances were not statistically signi\ufb01cant (we could not expect much improve-\nment by feature weighting because the misclassi\ufb01cation rates of the plain 1NN classi\ufb01er are already\nvery low), the number of genes used for NN classi\ufb01cation can be greatly reduced. The results\nillustrate the potential advantage of feature selection using the SQP-FW algorithm.\n\n5 Discussion and Conclusion\n\nTN change is a fundamental problem in feature extraction and selection for NN classi\ufb01ers. Our\ncontribution in this paper was to present a feature weighting algorithm that can systematically handle\nTN changes and guarantee the TN-weight consistency. An important future direction is to generalize\nour TN-weight consistent feature weighting scheme to feature extraction (i.e., metric learning).\n\nAcknowledgment\n\nIT was supported by MEXT KAKENHI 21200001 and 23700165, and MS was supported by MEXT\nKAKENHI 23120004.\n\n8\n\n\fReferences\n[1] A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online\nIn Proceedings of the 16th International Conference on World Wide\n\ncollaborative \ufb01ltering.\nWeb, pages 271\u2013280. ACM, 2007.\n\n[2] S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classi-\n\ufb01cation of tumors using gene expression data. Journal of the American Statistical Association,\n97(457):77\u201387, 2002.\n\n[3] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to\nclustering with side-information. In S. Thrun S. Becker and K. Obermayer, editors, Advances\nin Neural Information Processing Systems 15, pages 505\u2013512. MIT Press, Cambridge MA,\n2003.\n\n[4] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analy-\nsis. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing\nSystems 17, pages 513\u2013520. MIT Press, Cambridge, MA, 2005.\n\n[5] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neigh-\nbor classi\ufb01cation. In Y. Weiss, B. Sch\u00a8olkopf, and J. Platt, editors, Advances in Neural Infor-\nmation Processing Systems 18, pages 1473\u20131480. MIT Press, Cambridge, MA, 2006.\n\n[6] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. In\nProceedings of the 24th International Conference on Machine Learning, pages 209\u2013216, 2007.\n[7] K. Kira and L. Rendell. A practical approach to feature selection. In Proceedings of the 9-th\n\nInternational Conference on Machine Learning, pages 249\u2013256, 1992.\n\n[8] I. Kononenko. Estimating attributes: analysis and extensions of relief.\n\nEuropean Conference on Machine Learning, pages 171\u2013182, 1994.\n\nIn Proceedings of\n\n[9] R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin based feature selection - theory and\nalgorithms. In Proceedings of the 21st International Conference on Machine Learning, pages\n43\u201350, 2004.\n\n[10] Y. Sun and J. Li. Iterative relief for feature weighting. In Proceedings of the 23-rd International\n\nConference on Machhine Learning, pages 913\u2013920, 2006.\n\n[11] Y. Sun, S. Todorovic, and S. Goodison. Local learning based feature selection for high di-\nmensional data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n32(9):1610\u20131626, 2010.\n\n[12] K. Wagsta, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with back-\nground knowledge. In Proceedings of the Eighteenth International Conference on Machine\nLearning, pages 577\u2013584, 2001.\n\n[13] M. Sugiyama. Dimensionality reduction of multimodal labeled data by local \ufb01sher discrimi-\n\nnant analysis. Journal of Machine Learning Research, 8:1027\u20131061, 2007.\n\n[14] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support\n\nvector machine. Journal of Machine Learning Research, 5:1391\u20131415, 2004.\n\n[15] J. Nocedal and S. J. Wright. Numerical optimization. Springer, 1999.\n[16] U. Alon, N. Barkia, D.A. Notterman, and K. Gish et al. Broad patterns of gene expression\nrevealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide\narrays. Proc. Natl. Acad. Sci. USA, 96:6745\u20136750, 1999.\n\n[17] H. Sueltmann, A. Heydenbreck, W. Huber, and R. Kuner et al. Gene expression in kidney\ncancer is associated with novel tumor subtypes, cytogenetic abnormalities and metastasis for-\nmation. 8:1027\u20131061, 2007.\n\n[18] T. R. Golub, D. K. Slonim, P. Tamayo, and C. Huard et al. Molecular classi\ufb01cation of cancer:\nclass discovery and class prediction by gene expression monitoring. Science, 286:531\u2013537,\n1999.\n\n[19] D. Singh, P. G. Febbo, K. Ross, and D. G. Jackson et al. Gene expression correlates of clinical\n\nprostate cancer behavior. Cancer Cell, 1:203\u2013209, 2002.\n\n9\n\n\f", "award": [], "sourceid": 392, "authors": [{"given_name": "Ichiro", "family_name": "Takeuchi", "institution": null}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}]}