{"title": "Generative Local Metric Learning for Kernel Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 2452, "page_last": 2462, "abstract": "This paper shows how metric learning can be used with Nadaraya-Watson (NW) kernel regression. Compared with standard approaches, such as bandwidth selection, we show how metric learning can significantly reduce the mean square error (MSE) in kernel regression, particularly for high-dimensional data. We propose a method for efficiently learning a good metric function based upon analyzing the performance of the NW estimator for Gaussian-distributed data. A key feature of our approach is that the NW estimator with a learned metric uses information from both the global and local structure of the training data. Theoretical and empirical results confirm that the learned metric can considerably reduce the bias and MSE for kernel regression even when the data are not confined to Gaussian.", "full_text": "Generative Local Metric Learning for\n\nKernel Regression\n\nYung-Kyun Noh\n\nMasashi Sugiyama\n\nSeoul National University, Rep. of Korea\n\nRIKEN / The University of Tokyo, Japan\n\nnohyung@snu.ac.kr\n\nsugi@k.u-tokyo.ac.jp\n\nKee-Eung Kim\n\nKAIST, Rep. of Korea\n\nkekim@cs.kaist.ac.kr\n\nSeoul National University, Rep. of Korea\n\nFrank C. Park\n\nfcp@snu.ac.kr\n\nDaniel D. Lee\n\nUniversity of Pennsylvania, USA\n\nddlee@seas.upenn.edu\n\nAbstract\n\nThis paper shows how metric learning can be used with Nadaraya-Watson (NW)\nkernel regression. Compared with standard approaches, such as bandwidth selec-\ntion, we show how metric learning can signi\ufb01cantly reduce the mean square error\n(MSE) in kernel regression, particularly for high-dimensional data. We propose a\nmethod for ef\ufb01ciently learning a good metric function based upon analyzing the\nperformance of the NW estimator for Gaussian-distributed data. A key feature of\nour approach is that the NW estimator with a learned metric uses information from\nboth the global and local structure of the training data. Theoretical and empirical\nresults con\ufb01rm that the learned metric can considerably reduce the bias and MSE\nfor kernel regression even when the data are not con\ufb01ned to Gaussian.\n\n1\n\nIntroduction\n\n(cid:80)N\n(cid:80)N\n\n(cid:98)y(x) =\n\nThe Nadaraya-Watson (NW) estimator has long been widely used for nonparametric regression\n[16, 26]. The NW estimator uses paired samples to compute a locally weighted average via a kernel\nfunction, K(\u00b7,\u00b7): RD \u00d7 RD \u2192 R, where D is the dimensionality of data samples. The resulting\nestimated output for an input x \u2208 RD is given by the equation:\n\ni=1 K(xi, x)yi\ni=1 K(xi, x)\n\n(1)\nfor data D = {xi, yi}N\ni=1 with xi \u2208 RD and yi \u2208 R, and a translation-invariant kernel\nK(xi, x) = K((x \u2212 xi)2). This estimator is regarded as a fundamental canonical method in\nsupervised learning for modeling non-linear relationships using local information. It has previously\nbeen used to interpret predictions using kernel density estimation [11], memory retrieval, decision\nmaking models [19], minimum empirical mean square error (MSE) with local weights [10, 23], and\nsampling-based Bayesian inference [25]. All of these interpretations utilize the fact that the estimator\nwill asymptotically converge to the optimal Ep(y|x)[y] with minimum MSE given an in\ufb01nite number\nof data samples.\n\nHowever, with \ufb01nite samples, the NW output(cid:98)y(x) is no longer optimal and can deviate signi\ufb01cantly\n\nfrom the true conditional expectation. In particular, the weights given along the directions of large\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Metric dependency of kernels. The level curves of kernels are hyper-spheres for isotropic\nkernels in (a), while they are hyper-ellipsoids for kernels with the Mahalanobis metric as shown in (b).\nThe principal directions of hyper-ellipsoids are the eigenvectors of the symmetric positive de\ufb01nite\nmatrix A which is used in the Mahalanobis distance. When the target variable y varies along the \u2207y\ndirection in the \ufb01gure, the weighted average will give less bias if the metric is extended along the\northogonal direction of \u2207y as shown in (b).\n\nvariability in y\u2014e.g. the direction of \u2207y as in Fig. 1(a)\u2014causes signi\ufb01cant deviation. In this case,\nnaively changing the kernel shape, as shown in Fig. 1(b), can alleviate the deviation. In this work, we\ninvestigate more sophisticated methods for \ufb01nding appropriate kernel shapes via metric learning.\nMetric learning is used to \ufb01nd speci\ufb01c directions with increased variability. Using information from\nthe training examples, metric learning shrinks or extends distances in directions that are more or\nless important. A number of studies have focused on using metric learning for nearest neighbor\nclassi\ufb01cation [3, 6, 8, 17, 27], and many recent works have applied it to kernel methods as well\n[12, 13, 28]. Most of these approaches focus on modifying relative distances using triplet relationships\nor minimizing empirical error with some regularization.\nIn conventional NW regression, the deviation due to \ufb01nite sampling is mitigated by controlling the\nbandwidth of the kernel function. The bandwidth controls the balance between the bias and the\nvariance of the estimator, and the \ufb01nite-sample deviation is reduced with appropriate selection of the\nbandwidth [9, 20, 21]. Other approaches include trying to explicitly subtract an estimated bias [5, 24]\nor using a higher-order kernel which eliminates the leading-order terms of the bias [22]. However,\nmany of these direct approaches behave improperly in high-dimensional spaces for two reasons;\ndistance information is dominated by noise, and by using only nearby data, local algorithms suffer\ndue to the small number of data used effectively by the algorithms.\nIn this work, we apply a metric learning method for mitigating the bias. Differently from conventional\nmetric learning methods, we analyze the metric effect on the asymptotic bias and variance of the NW\nestimator. Then we apply a generative model to alleviate the bias in a high-dimensional space. Our\ntheoretical analysis shows that with a jointly Gaussian assumption on x and y, the metric learning\nmethod reduces to a simple eigenvector problem of \ufb01nding a two-dimensional embedding space\nwhere the noise is effectively removed. Our approach is similar to the previous method in applying a\nsimple generative model to mitigate the bias [18], but our analysis shows that there always exists a\nmetric that eliminates the leading-order bias for any shape of Gaussians, and two dimensionality is\nenough to achieve the zero bias. The algorithm based on this analysis shows a good performance for\nmany benchmark datasets. We interpret the result to mean that the NW estimator indirectly uses the\nglobal information through the rough generative model, and the results are improved because the\ninformation from the global covariance structure is additionally used, which would never be used in\nNW estimation otherwise.\nOne well-known extension of NW regression for reducing its bias is locally linear regression (LLR)\n[23]. LLR shows a zero-bias as well for data from Gaussian, but the parameter is solely estimated\nlocally, which is prone to over\ufb01tting in high-dimensional problems. In our experiments, we compare\nour method with LLR and demonstrate that our method compares favorably with LLR and other\ncompetitive methods..\nThe rest of the paper is organized as follows. In Section 2, we explain our metric learning formulation\nfor kernel regression. In Section 3, we derive the bias and its relationship to the metric, and our\nproposed algorithm is introduced in Section 4. In Section 5, we provide experiments with other\nstandard regression methods, and conclude with a discussion in Section 6.\n\n2\n\n\f2 Metric Learning in Kernel Methods\n\n(cid:113)\n\nWe consider a Mahalanobis-type distance for metric learning. The Mahalanobis-type distance between\ntwo data points xi \u2208 RD and xj \u2208 RD is de\ufb01ned in this work as\n\nA (cid:31) 0, A(cid:62) = A,\n\n|A| = 1\n\n||xi \u2212 xj||A =\n\n(xi \u2212 xj)(cid:62)A(xi \u2212 xj),\n\n(2)\nwith a symmetric positive de\ufb01nite matrix A \u2208 RD\u00d7D and |A|, the determinant of A. By using this\nmetric, we consider a metric space where the distance is extended or shrunk along the directions\nof eigenvectors of A, while the volume of the hypersphere is kept the same due to the determinant\nconstraint. With an identity matrix A = I, we obtain the conventional Euclidean distance.\nA kernel function capturing the local information typically decays rapidly outside a certain distance;\nconventionally a bandwidth parameter h is used to control the effective number of data within the\nrange of interests. If we use the Gaussian kernel as an example, with the aforementioned metric and\nbandwidth, the kernel function can be written as\n\n(cid:17)\n\n(cid:19)\n\n(cid:16)\n\n(cid:18)\n\n(cid:18)||xi \u2212 x||A\n\n(cid:19)\n\nh\n\nK(xi, x) = K\n\n=\n\n1\u221a\n2\u03c0\n\nD\n\nhD\n\nexp\n\n(cid:62)\n\u2212 1\n2h2 (xi \u2212 x)\n\nA (xi \u2212 x)\n\n,\n\n(3)\n\nwhere the \u201crelative\u201d bandwidths along individual directions are determined by A, and the overall size\nof the kernel is determined by h.\n\n3 Bias of Nadaraya-Watson Kernel Estimator\nWe \ufb01rst note that our target function is the conditional expectation y(x) = E[y|x], and it is invariant\nto metric change. When we consider a D-dimensional vector x \u2208 RD and its stochastic prediction\ny \u2208 R, the conditional expectation y(x) = E[y|x] minimizes the MSE. If we consider two different\nspaces with coordinates x \u2208 RD and z \u2208 RD and a linear transformation between these two spaces,\nz = L(cid:62)x, with a full-rank square matrix L \u2208 RD\u00d7D, the expectation of y is invariant to the\ncoordinate change satisfying E[y|x] = E[y|z], because the conditional density is preserved by the\nmetric change: p(y|x) = p(y|z) for all corresponding x and z, and\n\n(cid:90)\n\n(cid:90)\n\nE[y|x] =\n\ny p(y|x)dy =\n\ny p(y|z)dy = E[y|z].\n\n(4)\n\nThe equivalence means that the target function is invariant to metric change with A = LL(cid:62), and\nconsidering that the NW estimator achieves the optimal prediction E[y|x] with in\ufb01nite data, optimal\nprediction is achieved with in\ufb01nite data regardless of the choice of metric. Thus the metric dependency\nis actually a \ufb01nite sampling effect along with the bias and the variance.\n\n3.1 Metric Effects on Bias\n\nThe bias is the expected deviation of the estimator from the true mean of the target variable y(x):\n\nBias = E [(cid:98)y(x) \u2212 y(x)] = E\n\n(cid:34)(cid:80)N\n(cid:80)N\n\n(cid:35)\n\ni=1 K(xi, x)yi\ni=1 K(xi, x)\n\n\u2212 y(x)\n\n.\n\n(5)\n\nStandard methods for calculating the bias assume asymptotic concentration around the means, both in\nthe numerator and in the denominator of the NW estimator. Usually, the numerator and denominator\nof the bias are approximated separately, and the bias of the whole NW estimator is calculated using\n\na plug-in method [15, 23]. We assume a kernel satisfying(cid:82) K(z)dz = 1,(cid:82) zK(z)dz = 0, and\n(cid:82) zz(cid:62)K(z)dz = I. For example, the Gaussian kernel in Eq. (3) satis\ufb01es all of these conditions.\n\nThen we can \ufb01rst approximate the denominator as1\n\nEx1,...,xN\n\nK(xi, x)\n\n= p(x) +\n\n\u22072p(x) + O(h4),\n\nh2\n2\n\n(6)\n\n(cid:35)\n\n(cid:34)\n\n1\nN\n\nN(cid:88)\n\ni=1\n\n1See Appendix in the supplementary material for the detailed derivation.\n\n3\n\n\fwith Laplacian \u22072, the trace of the Hessian with respect to x. Similarly, the expectation of the\nnumerator becomes\n\nK(x, xi)yi\n\n= p(x)y(x) +\n\n\u22072[p(x)y(x)] + O(h4).\n\n(7)\n\nh2\n2\n\n(cid:34)\n\nN(cid:88)\n\ni=1\n\nEx1, . . . , xN ,\n\ny1, . . . , yN\n\n1\nN\n\n(cid:35)\n\nUsing the plug-ins of Eq. (6) and Eq. (7), we can \ufb01nd the leading-order terms of the NW estimation,\nand the bias of the NW estimator can be obtained as follows:\n\n(cid:34)(cid:80)N\n(cid:80)N\n\nE\n\n(cid:35)\n\n(cid:18)\u2207(cid:62)p(x)\u2207y(x)\n\np(x)\n\n\u22072y(x)\n\n2\n\n+\n\n(cid:19)\n\ni=1 K(x, xi)yi\ni=1 K(x, xi)\n\n\u2212 y(x)\n\n= h2\n\n+ O(h4).\n\n(8)\n\nHere, all gradients \u2207 and Laplacians \u22072 are with respect to x. We have noted that the target\ny(x) = E[y|x] is invariant to the metric change, and the metric dependency comes from the \ufb01nite\nsample deviation terms. Here, both the gradient and the Laplacian in the deviation are dependent on\nthe change of metric A.\n\n3.2 Conventional Methods of Reducing Bias\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\nPreviously, there have been works intended to reduce the deviation [9, 20, 21]. A standard approach\nis to adapt the size of bandwidth parameter h under the minimum MSE criterion. Bandwidth selection\nhas an intuitive motivation of balancing the tradeoff between the bias and the variance; the bias can\nbe reduced by using a small bandwidth but at the cost of increasing the variance. Therefore, for\nbandwidth selection, the bias and variance criteria have to be used at the same time.\nAnother straightforward and well-known extension of the NW estimator is the locally linear regression\n(LLR) [2, 23]. Considering that Eq. (1) is the solution minimizing the local empirical MSE:\n\ny(x) = arg min\n\u03b1\u2208R\n\nthe LLR extends this objective function to\n\n[y(x), \u03b2\u2217(x)] = arg min\n\n\u03b1\u2208R,\u03b2\u2208RD\n\n(yi \u2212 \u03b1)2K(xi, x),\n\n(cid:0)yi \u2212 \u03b1 \u2212 \u03b2(cid:62)(xi \u2212 x)(cid:1)2\n\nK(xi, x),\n\n(9)\n\n(10)\n\nto eliminate the noise produced by the linear component of the target function. The vector parameter\n\u03b2\u2217(x) \u2208 RD is the estimated local gradient using local data, and this vector often over\ufb01ts in a\nhigh-dimensional space resulting in a poor solution of \u03b1.\nHowever, LLR asymptotically produces the bias of\n\nBiasLLR =\n\n\u22072y(x) + O(h4).\n\nh2\n2\n\n(11)\n\nEq. (11) can be compared with the NW bias in Eq. (8), where the bias term from the linear variation\nof y with respect to x, h2 \u2207(cid:62)p\u2207y\n\n, is eliminated.\n\np\n\n4 Metric for Nadaraya-Watson Regression\n\nIn this section, we propose a metric that appropriately reduces the metric-dependent bias of the NW\nestimator.\n\n4.1 Nadaraya-Watson Regression for Gaussian\n\nIn order to obtain a metric, we \ufb01rst provide the following theorem which guarantees the existence of\na good metric that eliminates the leading order bias at any point regardless of the con\ufb01guration of\nGaussian.\nTheorem 1: At any point x, there exists a metric matrix A, such that for data x \u2208 RD and the output\ny \u2208 R jointly generated from any (D + 1)-dimensional Gaussian, the NW regression with distance\nd(x, x(cid:48)) = ||x \u2212 x(cid:48)||A, for x, x(cid:48) \u2208 RD, has a zero leading-order bias.\n\n4\n\n\fBased on the theorem, we will consider using the corresponding metric space for NW regression at\neach point. The theorem is proven using the following Proposition 2 and Lemma 3, which are general\nclaims without the Gaussian assumptions.\nProposition 2: There exists a symmetric positive de\ufb01nite matrix A that eliminates the \ufb01rst term\n\u2207(cid:62)p(x)\u2207y(x)\ninside the bias in Eq. (8), when used with the metric in Eq. (2), and when there exist two\n\nlinearly independent gradients of p(x) and y(x), and p(x) is away from zero.\nProof: We consider a coordinate transformation z = L(cid:62)x with L satisfying A = LL(cid:62). The gradient\nof a differentiable function y(.) and a density function p(.) with respect to z is\n\np(x)\n\n\u2207zy(z)\n\n= L\u22121\u2207xy(x) , \u2207zp(z)\n\n|L| L\u22121\u2207xp(x),\n\nand the scalar \u2207(cid:62)p(x)\u2207y(x) in the Euclidean space can be rewritten in the transformed space as\n\n(cid:12)(cid:12)(cid:12)z=L(cid:62)x\n\n1\n\n=\n\n(cid:12)(cid:12)(cid:12)z=L(cid:62)x\nzy(z)\u2207z p(z)(cid:1)\n\n\u2207(cid:62)\nz p(z)\u2207zy(z) =\n\n=\n\nx p(x)L\u2212(cid:62)L\u22121\u2207xy(x) + \u2207xy(x)L\u2212(cid:62)L\u22121\u2207(cid:62)\n\n=\n\n(15)\nThe symmetric matrix B = \u2207y(x)\u2207(cid:62)p(x) + \u2207p(x)\u2207(cid:62)y(x) has rank two with independent \u2207y(x)\nand \u2207p(x) and can be eigen-decomposed as\n\nx p(x) + \u2207xp(x)\u2207(cid:62)\n\n1\n2\n1\n2|L|\n1\n2|A| 1\n\nz p(z)\u2207zy(z) + \u2207(cid:62)\n\n(cid:0)\u2207(cid:62)\n(cid:0)\u2207(cid:62)\ntr(cid:2)A\u22121(cid:0)\u2207xy(x)\u2207(cid:62)\n(cid:19)(cid:104)\n(cid:105)(cid:18) \u03bb1\n\n(cid:104)\n\n2\n\n0\n\u03bb2\n\n0\n\nB =\n\nu1 u2\n\nu1 u2\n\n(cid:105)(cid:62)\n\nx p(x)(cid:1)\n\nx y(x)(cid:1)(cid:3) .\n\n(12)\n\n(13)\n\n(14)\n\n(16)\n\nwith eigenvectors u1 and u2 and nonzero eigenvalues \u03bb1 and \u03bb2. A suf\ufb01cient condition for the\nexistence of A is that the two eigenvalues have different signs, in other words, \u03bb1\u03bb2 < 0.\nLet \u03bb1 > 0 and \u03bb2 < 0 without loss of generality, and we choose a positive de\ufb01nite matrix having\nthe following eigenvector decomposition:\n\n(cid:104)\n\nu1 u2 \u00b7\u00b7\u00b7(cid:105)\uf8eb\uf8ec\uf8ed \u03bb1\n\nA =\n\n0\n0 \u2212\u03bb2\n...\n\n\uf8f6\uf8f7\uf8f8(cid:104)\n\n\u00b7\u00b7\u00b7\n\n...\n\nu1 u2 \u00b7\u00b7\u00b7(cid:105)(cid:62)\n\n.\n\n(17)\n\nThen Eq. (15) becomes zero, yielding a zero value for the \ufb01rst term of the bias with nonzero p(x).\nTherefore, we can always \ufb01nd A that eliminates the \ufb01rst term of the bias once B has one positive and\none negative eigenvalue, and the following Lemma 3 proves that B always has one positive and one\nnegative eigenvalue. (cid:4)\nLemma 3: A symmetric matrix B = (B(cid:48) +B(cid:48)(cid:62))/2 has two nonzero eigenvalues for a rank one matrix\nB(cid:48) = v1v(cid:62)\n2 with two linearly independent vectors, v1 and v2. Here, one of the two eigenvalues is\npositive, and the other is negative.\nProof: We can reformulate B as\n(v1v(cid:62)\n\n2 + v2v(cid:62)\n\nv1 v2\n\nv1 v2\n\n1 ) =\n\n(18)\n\nB =\n\n(cid:104)\n\n(cid:105)(cid:18) 0\n(cid:104)\n\n1\n\n(cid:19)(cid:104)\n(cid:104)\n\n1\n0\n\n(cid:105)(cid:62)\n\n(cid:105)(cid:62)\n(cid:105)\n\n.\n\nIf we make a new square matrix of size two, M =\n, the determinant of the\nmatrix is as follows using the eigen-decomposition of B with eigenvectors u1 and u2 and eigenvalues\n\u03bb1 and \u03bb2:\n\nv1 v2\n\nv1 v2\n\nB\n\n1\n2\n\n1\n2\n\n|M| =\n\n=\n\n(cid:105)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:105)(cid:20) \u03bb1\n\nv1 v2\n\nv1 v2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:104)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:104)\n\nB\n\n(cid:104)\n\n(cid:105)(cid:62)\n(cid:105)(cid:62)(cid:104)\n(cid:0)v(cid:62)\n1 u1v(cid:62)\n\nv1 v2\n\n= \u03bb1\u03bb2\n\n0\nu1 u2\n0\n\u03bb2\n2 u2 \u2212 v(cid:62)\n1 u2v(cid:62)\n\n2 u1\n\n,\n\nu1 u2\n\n(cid:21)(cid:104)\n(cid:1)2\n\n(cid:105)(cid:62)(cid:104)\n\nv1 v2\n\n(cid:105)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(19)\n\n(20)\n\n(21)\n\n5\n\n\fand at the same time, |M| is always negative by the following derivation:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:104)\n\n|M| =\n\n(cid:105)(cid:62)\n\n(cid:104)\n\n(cid:105)(cid:12)(cid:12)(cid:12)(cid:12) =\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:104)\n\n1\n2\n\n(cid:105)(cid:62)(cid:104)\n\n(cid:105)(cid:12)(cid:12)(cid:12)(cid:12)2(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) 0\n\n1\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) < 0.\n\n1\n0\n\nv1 v2\n\nB\n\nv1 v2\n\nv1 v2\n\nv1 v2\n\n(22)\n\nFrom these calculations, \u03bb1\u03bb2 < 0, and \u03bb1 and \u03bb2 always have different signs. (cid:4)\nWith Proposition 2 and Lemma 3, we always have a metric space associated with A in Eq. (17) that\neliminates the leading order bias of a Gaussian, because \u22072y(x) = 0 is always satis\ufb01ed for x and y\nwhich are jointly Gaussian, eliminating the second term of Eq. (8) as well.\n\n4.2 Gaussian Model for Metric Learning\n\n(cid:18) \u03bb+\n\n(cid:19)\n\nWe now know there exists an interesting scaling by a metric change where the NW regression achieves\nthe bias O(h4). The metric we use is as follows:\n\n|ANW| = 1.\n\n0\n0 \u2212\u03bb\u2212\n\n[u+u\u2212](cid:62) + \u03b3I,\n\nANW = \u03b2[u+u\u2212]\n\n(23)\nHere, \u03b2 is the constant determined from the constraint |ANW| = 1. We use one positive and one\nnegative eigenvalue, \u03bb+ > 0 and \u03bb\u2212 < 0, from matrix B:\n\nfor\n\nh2\n\n2p(x)\u03b2\n\n\u03bb\u2212+\u03b3\n\n(cid:17)\n\n(cid:17)\n\nB = \u2207y(x)\u2207(cid:62)p(x) + \u2207p(x)\u2207(cid:62)y(x),\n\n= \u03b3h2\n2p(x)\u03b2\n\nNWB] = h2\n\n2p(x) tr[A\u22121\n\n(cid:16) \u03bb+ \u2212 \u03bb\u2212\n\n(cid:16) \u03bb+\n\u03bb++\u03b3 \u2212 \u03bb\u2212\n\n(24)\nand their corresponding eigenvectors u+ and u\u2212. A small positive regularization constant \u03b3 is added\nafter being multiplied by the identity matrix.\nBy adding a regularization term to the metric, the deviation with exact \u2207p(x) and \u2207y(x) becomes\nnonzero, but a small value,\n+\nO(\u03b32). However, with small \u03b3, the deviation is still low unless p(x) is close to zero, or \u2207p(x) and\n\u2207y(x) are parallel.\nThe matrix ANW is obtained for every point of interest, and the NW regression of each point is\nperformed with a different ANW calculated at each point. ANW is a function of x, but the changing\npart is only the rank two matrix, and the calculation is simple, since we only have to solve the\neigenvector problem of a 2 \u00d7 2 matrix for each query point regardless of the original dimensionality.\nNote that the bandwidth h is not yet included for the optimization when we obtain the metric. After\nwe obtain the metric, we can still use bandwidth selection for even better MSE.\nIn order to obtain the metric ANW, at every query, we need the information of \u2207p(x) and \u2207y(x).\nThe knowledge of true y(x) and p(x) is unknown, and we need to obtain the gradient information\nfrom data again. Previously, the gradient information was obtained locally with a small number of\nsamples [4, 7], but such methods are not preferred here because we need to overcome the corruption\nof the local information in high-dimensional cases. Instead, we use a global parametric model: Using\na single Gaussian model for all data, we estimate the gradient of true y(x) and p(x) at each point\nfrom the global con\ufb01guration of data \ufb01tted by a single Gaussian:\n\n\u03bb+\u03bb\u2212\n\n(cid:18)(cid:18) y\n\n(cid:19)(cid:19)\n\n= N\n\n(cid:18)(cid:18) \u00b5y\n\n(cid:19)\n\n(cid:18) \u03a3y \u03a3yx\n\n(cid:19)(cid:19)\n\n,\n\np\n\nx\n\n(25)\n\u00b5x\nx (x \u2212 \u00b5x) + \u00b5y (See Appendix) can be analytically\nIn fact, the target function y(x) = \u03a3yx\u03a3\u22121\nobtained in a closed form when we estimate the parameters of the Gaussian, but we reuse y(x) for\ngradients for metric learning can be obtained using \u2207y(x) =(cid:98)\u03a3\u22121\nx (x \u2212(cid:98)\u00b5x)\nenhancement of the NW regression, and the NW regression updates y(x) using local information. The\nfrom the estimated parameters(cid:98)\u03a3x,(cid:98)\u03a3xy, and(cid:98)\u00b5x if the global model is Gaussian. A pseudo-code of\n\nx (cid:98)\u03a3xy and \u2207p(x)\n\np(x) = \u2212(cid:98)\u03a3\u22121\n\n\u03a3xy \u03a3x\n\n.\n\nthe proposed method is presented in Algorithm 1.\n\nInterpretation of the Metric\n\n4.3\nThe learned metric ANW considers the two-dimensional subspace spanned by \u2207p(x) =\n\u2212p(x)\u03a3\u22121\nx \u03a3xy. The two-dimensionality analysis of the metric shows\nthat the distant points are used for those in the space orthogonal to this two-dimensional subspace.\n\nx (x \u2212 \u00b5x) and \u2207y(x) = \u03a3\u22121\n\n6\n\n\fAlgorithm 1 Generative Local Metric Learning for NW Regression\n\nOutput: regression output(cid:98)y(x)\n\nInput: data D = {xi, yi}N\nProcedure:\n1: Find joint covariance matrix \u03a3 =\n\ni=1 and point for regression x\n\n(cid:18) \u03a3y \u03a3yx\n\n\u03a3xy \u03a3x\n\n(cid:19)\n\nand mean vector \u00b5 =\n\n(cid:19)\n\n(cid:18) \u00b5y\n\n\u00b5x\n\nfrom data D.\n\n2: Obtain two eigenvectors\n\u2207p(x)\n||\u2207p(x)|| +\nand their corresponding eigenvalues\n\nu1 =\n\n\u2207y\n||\u2207y||\n\nand u2 =\n\n||\u2207p(x)|| \u2212 \u2207y\n\u2207p(x)\n||\u2207y|| ,\n\n(26)\n\n\u03bb1 =\n\n1\n\n2p(x)\n\n(\u2207y(cid:62)\u2207p + ||\u2207y||||\u2207p||)\n\nand \u03bb2 =\n\n1\n\n2p(x)\n\n(\u2207y(cid:62)\u2207p \u2212 ||\u2207y||||\u2207p||),\n\n(27)\n\nusing\n\n\u2207p(x) = \u2212p(x)\u03a3\u22121\n\nx (x \u2212 \u00b5x)\n\nand \u2207y = \u03a3\u22121\n\nx \u03a3xy.\n\n3: Obtain the transform matrix L using u1, u2, \u03bb1, and \u03bb2:\n\u03bb1 + \u03b3/T\u221a\u2212\u03bb2 + \u03b3/T\u221a\n\n\u221a\n\n\uf8eb\uf8ec\uf8ec\uf8ed |\n\n|\n\n|\n\n|\n\n\uf8f6\uf8f7\uf8f7\uf8f8\n\uf8eb\uf8ec\uf8ec\uf8ed\n\nL =\n\nu1||u1|| u2||u2||\n\nUo\n\nwith T = (cid:0)(\u03bb1 + 1)(\u2212\u03bb2 + \u03b3)\u03b3D\u22122(cid:1) 1\n\n\uf8f6\uf8f7\uf8f7\uf8f8\n\n. . .\u221a\n\n\u03b3/T\n\n\u03b3/T\u221a\n\n\u03b3/T\n\n2D , a small constant \u03b3, and an orthonomal matrix Uo \u2208\n\nRD\u00d7(D\u22122) spanning the normal space of u1 and u2.\n4: Perform NW regression at z = L(cid:62)x using transformed data zi = L(cid:62)xi, i = 1, . . . , N.\n\n(28)\n\n(29)\n\nThis fact has the effect of virtually increasing the amount of data compared with algorithms with\nisotropic kernels, particularly in high-dimensional space.\nThe following proposition gives an intuitive explanation that the bias reduction is more important\nin high-dimensional space than the reduction of the variance once the optimal bandwidth has been\nselected balancing the leading terms of the bias and variance after the change of metric. Proposition\n2, Lemma 3, and the following Proposition 4 are obtained without any Gaussian assumption.\nProposition 4: Let us simplify the MSE as the squared bias obtained from the leading terms in\nEq. (8) and the variance2, i.e.,\n\n(31)\nThen, at some h\u2217, it has the the minimum f (h\u2217) = C1 in the limit with in\ufb01nite D, where D is the\ndimensionality of data.\n\nf (h) = h4C1 +\n\nN hD C2.\n\nProof: The optimal h can be obtained using \u2202f (h)\n\u2202h\n\n= 0, and the optimal h is\n\nh\u2217 = N\u2212 1\n\nD+4\n\n2See Section 6 of the Appendix:\n\n(cid:18)\u2207(cid:62)p(x)\u2207y(x)\n\np(x)\n\nC1 =\n\n\u22072y(x)\n\n2\n\n+\n\n.\n\n(32)\n\n(30)\n\n\u221a\n1\n\u03c0)D\n(2\n\n\u03c32\ny(x)\np(x)\n\n1\n\n(cid:12)(cid:12)(cid:12)h=h\u2217\n(cid:18) D \u00b7 C2\n(cid:19)2\n\n4 \u00b7 C1\n\n(cid:19) 1\n\nD+4\n\nand C2 =\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Metric calculation for a Gaussian and gradient \u2207y. (b) Empirical MSEs with and\nwithout the metric. (c) Leading order terms in MSE with optimal bandwidth for various numbers of\ndata.\n\nBy plugging h\u2217 into f (h) in Eq. (31), we obtain\n\n(cid:32)(cid:18) D\n\n(cid:19) 4\n\nD+4\n\n4\n\nD+4(cid:33)\n(cid:19) D\n\n(cid:18) 4\n\nD\n\n+\n\nf (h\u2217) = N\u2212 4\n\nD+4\n\nD\n\n4\n\nC\n\nD+4\n\nD+4\n1 C\n2\n\n(cid:39) C1.\n\n(for D (cid:29) 4). (cid:4)\n\n(33)\n\nN hD C2 is the\nIn Proposition 4, the \ufb01rst term h4C1 is the square of the bias, and the second term 1\nderived variance. The MSE is minimized in a high-dimensional space only through the minimization\nof the bias when it is accompanied by the optimization with respect to the bandwidth h. The plot of\nMSE in Fig. 2(c) shows that the MSE with bandwidth selection quickly approaches C1 in particular\nwith a small number of data. The derivation shows that we can ignore the variance optimization with\nrespect to the metric change. We only focus on achieving a small bias and rather than minimizing the\nvariance, the bandwidth selection follows later.\n\n5 Experiments\n\nThe proposed algorithm is evaluated using both synthetic and real datasets. For a Gaussian, Fig. 2(a)\ndepicts the eigenvectors along with the eigenvalues of the matrix B = \u2207y\u2207(cid:62)p + \u2207p\u2207(cid:62)y at different\npoints in the two-dimensional subspace spanned by \u2207y and \u2207p. The metric can be compared with\nthe adaptive scaling proposed in [14], which determines the metric according to the average amount\nof \u2207y. Our metric also uses \u2207y, but the metric is determined using the relationship with \u2207p.\nFig. 2(a) shows the metric eigenvalues and eigenvectors at each point for a two-dimensional Gaussian\nwith a covariance contour in the \ufb01gure. With Gaussian data, the MSE with the proposed metric is\nshown along with MSE with the Euclidean metric in Fig. 2(b). The metric is obtained from the\nestimated parameter of a jointly Gaussian model, where the result with a learned metric shows a huge\ndifference in the MSE.\nFor real-data experiments, we used the Delve datasets (Abalone, Bank-8fm, Bank-32fh, CPU), UCI\ndatasets (Community, NavalC, NavalT, Protein, Slice), KEEL datasets (Ailerons, Elevators, Puma32h)\n[1], and datasets from a previous paper (Pendulum, Pol) [15]. The datasets include dozens of features\nand several thousands to tens of thousands of data. Using a Gaussian model with regularized\nmaximum likelihood estimated parameters, we apply a metric which minimizes the bias with a \ufb01xed\n\u03b3 = max(|\u03bb1|,|\u03bb2|) \u00d7 10\u22122, and we choose h from a pre-chosen validation set. NW estimation with\nthe proposed metric (NW+GMetric) is compared with the conventional NW estimation (NW), LLR\n(LLR), the previous metric learning method for NW regression (NW+WMetric [28], NW+KMetric\n[14]), a more \ufb02exible Gaussian process regression (GPR) with the Gaussian kernel, and the Gaussian\n\nglobally linear model (GGL) using y(x) =(cid:98)\u03a3yx(cid:98)\u03a3\u22121\n\nx (x \u2212(cid:98)\u00b5x) +(cid:98)\u00b5y.\n\nFor eleven datasets among a total of fourteen datasets, the NW estimation with the proposed metric\nstatistically achieves one of the best performances. Even when the estimation does not achieve\nthe best performance, the metric always reduces the MSE from the original NW estimation. In\nparticular, in the Slice, Pol, CPU, NavalC, and NavalT datasets, GGL performs poorly showing the\nnon-Gaussianity of data, while the metric using the same information effectively reduces the MSE\n\n8\n\n\u2212101\u2212101 \u2207y(x)\fFigure 3: Regression with real-world datasets. NW is the NW regression with conventionial kernels,\nNW+GMetric is the NW regression with the proposed metric, LLR is the locally linear regression,\nNW+WMetric [28] and NW+KMetric [14] are different metrics for NW regression, GPR is the\nGaussian process regression, and GGL is the Gaussian globally linear model. Normalized MSE\n(NMSE) is the ratio between the MSE and the variance of the target value. If we constantly choose\nthe mean of the target, we get an NMSE of 1.\n\nfrom the original NW estimator. A detailed discussion comparing the proposed method with other\nmethods for non-Gaussian data is provided in Section 3 and 4 of the Appendix.\n\n6 Conclusions\n\nAn effective metric function is investigated for reducing the bias of NW regression. Our analysis has\nshown that the bias can be minimized under certain generative assumptions. The optimal metric is\nobtained by solving a series of eigenvector problems of size 2 by 2 and needs no explicit gradients or\ncurvature information.\nThe Gaussian model captures only the rough covariance structure of whole data. The proposed\napproach uses the global covariance to identify the directions that are most likely to have gradient\ncomponents, and the experiments with real data show that the method is effective for more reliable\nand less biased estimation. This is in contrast to LLR which attempts to eliminate the linear noise, but\nthe noise elimination relies on a small number of local data. In contrast, our model uses additional\ninformation from distant data only if they are close in the projected two-dimensional subspace. As a\nresult, the metric allows a more reliable unbiased estimation of the NW estimator.\nWe have also shown that minimizing the variance is relatively unimportant in high-dimensional\nspaces compared to minimizing the bias, especially when the bandwidth selection method is used.\nConsequently, our bias minimization method can achieve suf\ufb01ciently low MSE without the additional\ncomputational cost incurred by empirical MSE minimization.\n\n9\n\n\fAcknowledgments\n\nYKN acknowledges support from NRF/MSIT-2017R1E1A1A03070945, BK21Plus in Korea, MS from KAK-\nENHI 17H01760 in Japan, KEK from IITP/MSIT 2017-0-01778 in Korea, FCP from BK21Plus, MITIP-\n10048320 in Korea, and DDL from the NSF, ONR, ARL, AFOSR, DOT, DARPA in US.\n\nReferences\n\n[1] J. Alcal\u00e1-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garc\u00eda, L. S\u00e1nchez, and F. Herrera. KEEL\ndata-mining software tool: Data set repository, integration of algorithms and experimental\nanalysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3):255\u2013287,\n2011.\n\n[2] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Arti\ufb01cial Intelligence\n\nReview, 11(1-5):11\u201373, 1997.\n\n[3] A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and\n\nstructured data. CoRR, abs/1306.6709, 2013.\n\n[4] Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 17:790\u2013799, 1995.\n\n[5] E. Choi, P. Hall, and V. Rousson1. Data sharpening methods for bias reduction in nonparametric\n\nregression. Annals of Statistics, 28(5):1339\u20131355, 2000.\n\n[6] J.V. Davis, B. Kulis, P. Jain, S. Sra, and I.S. Dhillon. Information-theoretic metric learning. In\nProceedings of the 24th International Conference on Machine Learning, pages 209\u2013216, 2007.\n[7] K. Fukunaga and D. H. Larry. The estimation of the gradient of a density function, with\napplications in pattern recognition. IEEE Transactions on Information Theory, 21:32\u201340, 1975.\n[8] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components\n\nanalysis. In Advances in Neural Information Processing Systems 17, pages 513\u2013520. 2005.\n\n[9] P. Hall, S. J. Sheather, M. C. Jones, and J. S. Marron. On optimal data-based bandwidth selection\n\nin kernel density estimation. Biometrika, 78:263\u2013269, 1991.\n\n[10] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series\n\nin Statistics. Springer New York Inc., New York, NY, USA, 2001.\n\n[11] S. Haykin. Neural Networks and Learning Machines (3rd Edition). Prentice Hall, 2008.\n[12] R. Huang and S. Sun. Kernel regression with sparse metric learning. Journal of Intelligent and\n\nFuzzy Systems, 24(4):775\u2013787, 2013.\n\n[13] P. W. Keller, S. Mannor, and D. Precup. Automatic basis function construction for approximate\ndynamic programming and reinforcement learning. In Proceedings of the 23rd International\nConference on Machine Learning, pages 449\u2013456, 2006.\n\n[14] S. Kpotufe, A. Boularias, T. Schultz, and K. Kim. Gradients weights improve regression and\n\nclassi\ufb01cation. Journal of Machine Learning Research, 17(22):1\u201334, 2016.\n\n[15] M. Lazaro-Gredilla and A. R. Figueiras-Vidal. Marginalized neural network mixtures for\n\nlarge-scale regression. IEEE Transactions on Neural Networks, 21(8):1345\u20131351, 2010.\n\n[16] E. A. Nadaraya. On estimating regression. Theory of Probability and its Applications, 9:141\u2013\n\n142, 1964.\n\n[17] B. Nguyen, C. Morell, and B. De Baets. Large-scale distance metric learning for k-nearest\n\nneighbors regression. Neurocomputing, 214:805\u2013814, 2016.\n\n[18] Y.-K. Noh, B.-T. Zhang, and D. D. Lee. Generative local metric learning for nearest neighbor\nclassi\ufb01cation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(1):106\u2013118,\n2018.\n\n[19] R. M. Nosofsky and T. J. Palmeri. An exemplar-based random walk model of speeded classi\ufb01-\n\ncation. Psychological Review, 104(2):266\u2013300, 1997.\n\n[20] B. U. Park and J. S. Marron. Comparison of data-driven bandwidth selectors. Journal of the\n\nAmerican Statistical Association, 85:66\u201372, 1990.\n\n[21] B. U. Park and B. A. Turlach. Practical performance of several data driven bandwidth selectors.\n\nComputational Statistics, 7:251\u2013270, 1992.\n\n10\n\n\f[22] E. Parzen. On estimation ofa probability density function and mode. Annals of Mathematical\n\nStatistics, 33:1065\u20131076, 1962.\n\n[23] D. Ruppert and M. P. Wand. Multivariate Locally Weighted Least Squares Regression. The\n\nAnnals of Statistics, 22(3):1346\u20131370, 1994.\n\n[24] W. R. Schucany and John P. Sommers. Improvement of kernel type density estimators. Journal\n\nof the American Statistical Association, 72:420\u2013423, 1977.\n\n[25] L. Shi, T. L. Grif\ufb01ths, N. H. Feldman, and A. N. Sanborn. Exemplar models as a mechanism\n\nfor performing Bayesian inference. Psychonomic bulletin & review, 17(4):443\u2013464, 2010.\n\n[26] Geoffrey S. Watson. Smooth regression analysis. Sankhy\u00afa: The Indian Journal of Statistics,\n\nSeries A, 26:359\u2013372, 1964.\n\n[27] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest\nneighbor classi\ufb01cation. In Advances in Neural Information Processing Systems 18, pages\n1473\u20131480. 2006.\n\n[28] K. Q. Weinberger and G. Tesauro. Metric learning for kernel regression. In Eleventh interna-\n\ntional conference on arti\ufb01cial intelligence and statistics, pages 608\u2013615, 2007.\n\n11\n\n\f", "award": [], "sourceid": 1442, "authors": [{"given_name": "Yung-Kyun", "family_name": "Noh", "institution": "Seoul National University"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}, {"given_name": "Kee-Eung", "family_name": "Kim", "institution": "KAIST"}, {"given_name": "Frank", "family_name": "Park", "institution": "Seoul National University"}, {"given_name": "Daniel", "family_name": "Lee", "institution": "University of Pennsylvania"}]}