{"title": "Generalizing from Several Related Classification Tasks to a New Unlabeled Sample", "book": "Advances in Neural Information Processing Systems", "page_first": 2178, "page_last": 2186, "abstract": "We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distribution-free, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented.", "full_text": "Generalizing from Several Related Classi\ufb01cation\n\nTasks to a New Unlabeled Sample\n\nGilles Blanchard\nUniversit\u00a8at Potsdam\n\nblanchard@math.uni-potsdam.de\n\nGyemin Lee, Clayton Scott\n\nUniversity of Michigan\n\n{gyemin,clayscot}@umich.edu\n\nAbstract\n\nWe consider the problem of assigning class labels to an unlabeled test data set,\ngiven several labeled training data sets drawn from similar distributions. This\nproblem arises in several applications where data distributions \ufb02uctuate because\nof biological, technical, or other sources of variation. We develop a distribution-\nfree, kernel-based approach to the problem. This approach involves identifying\nan appropriate reproducing kernel Hilbert space and optimizing a regularized em-\npirical risk over the space. We present generalization error analysis, describe uni-\nversal kernels, and establish universal consistency of the proposed methodology.\nExperimental results on \ufb02ow cytometry data are presented.\n\n1\n\nIntroduction\n\nIs it possible to leverage the solution of one classi\ufb01cation problem to solve another? This is a ques-\ntion that has received increasing attention in recent years from the machine learning community, and\nhas been studied in a variety of settings, including multi-task learning, covariate shift, and transfer\nlearning. In this work we study a new setting for this question, one that incorporates elements of the\nthree aforementioned settings, and is motivated by many practical applications.\nTo state the problem, let X be a feature space and Y a space of labels to predict; to simplify the\nexposition, we will assume the setting of binary classi\ufb01cation, Y = {\u22121, 1} , although the method-\nology and results presented here are valid for general output spaces. For a given distribution PXY ,\nwe refer to the X marginal distribution PX as simply the marginal distribution, and the conditional\nPXY (Y |X) as the posterior distribution.\nXY on X \u00d7 Y, i = 1, . . . , N. For each i, there is a\nThere are N similar but distinct distributions P (i)\ntraining sample Si = (Xij, Yij)1\u2264j\u2264ni of iid realizations of P (i)\nXY . There is also a test distribution\nXY that is similar to but again distinct from the \u201ctraining distributions\u201d P (i)\nXY . Finally, there is a test\nP T\nsample (X T\nXY , but in this case the labels Yj are not observed.\nThe goal is to correctly predict these unobserved labels. Essentially, given a random sample from\nthe marginal test distribution P T\nX, we would like to predict the corresponding labels. Thus, when we\nsay that the training and test distributions are \u201csimilar,\u201d we mean that there is some pattern making\nit possible to learn a mapping from marginal distributions to labels. We will refer to this learning\nproblem as learning marginal predictors. A concrete motivating application is given below.\nThis problem may be contrasted with other learning problems.\nIn multi-task learning, only the\ntraining distributions are of interest, and the goal is to use the similarity among distributions to\nimprove the training of individual classi\ufb01ers [1, 2, 3]. In our context, we view these distributions\nas \u201ctraining tasks,\u201d and seek to generalize to a new distribution/task. In the covariate shift problem,\nthe marginal test distribution is different from the marginal training distribution(s), but the posterior\n\nj )1\u2264j\u2264nT of iid realizations of P T\n\nj , Y T\n\n1\n\n\fdistribution is assumed to be the same [4]. In our case, both marginal and posterior test distributions\ncan differ from their training counterparts [5].\nFinally, in transfer learning, it is typically assumed that at least a few labels are available for the\ntest data, and the training data sets are used to improve the performance of a standard classi\ufb01er, for\nexample by learning a metric or embedding which is appropriate for all data sets [6, 7]. In our case,\nno test labels are available, but we hope that through access to multiple training data sets, it is still\npossible to obtain collective knowledge about the \u201clabeling process\u201d that may be transferred to the\ntest distribution. Some authors have considered transductive transfer learning, which is similar to\nthe problem studied here in that no test labels are available. However, existing work has focused on\nthe case N = 1 and typically relies on the covariate shift assumption [8].\nWe propose a distribution-free, kernel-based approach to the problem of learning marginal predic-\ntors. Our methodology is shown to yield a consistent learning procedure, meaning that the general-\nization error tends to the best possible as the sample sizes N,{ni}, nT tend to in\ufb01nity. We also offer\na proof-of-concept experimental study validating the proposed approach on \ufb02ow cytometry data,\nincluding comparisons to multi-task kernels and a simple pooling approach.\n\n2 Motivating Application: Automatic Gating of Flow Cytometry Data\n\nj \u2208 {\u22121, 1} associated to each cell, where Y T\n\nFlow cytometry is a high-throughput measurement platform that is an important clinical tool for the\ndiagnosis of many blood-related pathologies. This technology allows for quantitative analysis of\nindividual cells from a given population, derived for example from a blood sample from a patient.\nWe may think of a \ufb02ow cytometry data set as a set of d-dimensional attribute vectors (Xj)1\u2264j\u2264n,\nwhere n is the number of cells analyzed, and d is the number of attributes recorded per cell. These\nattributes pertain to various physical and chemical properties of the cell. Thus, a \ufb02ow cytometry\ndata set is a random sample from a patient-speci\ufb01c distribution.\nNow suppose a pathologist needs to analyze a new (\u201ctest\u201d) patient with data (X T\nj )1\u2264j\u2264nT . Before\nproceeding, the pathologist \ufb01rst needs the data set to be \u201cpuri\ufb01ed\u201d so that only cells of a certain\ntype are present. For example, lymphocytes are known to be relevant for the diagnosis of leukemia,\nwhereas non-lymphocytes may potentially confound the analysis. In other words, it is necessary to\nj = 1 indicates that the j-th cell\ndetermine the label Y T\nis of the desired type.\nIn clinical practice this is accomplished through a manual process known as \u201cgating.\u201d The data are\nvisualized through a sequence of two-dimensional scatter plots, where at each stage a line segment\nor polygon is manually drawn to eliminate a portion of the unwanted cells. Because of the variability\nin \ufb02ow cytometry data, this process is dif\ufb01cult to quantify in terms of a small subset of simple rules.\nInstead, it requires domain-speci\ufb01c knowledge and iterative re\ufb01nement. Modern clinical laboratories\nroutinely see dozens of cases per day, so it would be desirable to automate this process.\nSince clinical laboratories maintain historical databases, we can assume access to a number (N) of\nhistorical patients that have already been expert-gated. Because of biological and technical varia-\ntions in \ufb02ow cytometry data, the distributions P (i)\nXY of the historical patients will vary. For example,\nFig. 1 shows exemplary two-dimensional scatter plots for two different patients, where the shaded\ncells correspond to lymphocytes. Nonetheless, there are certain general trends that are known to\nhold for all \ufb02ow cytometry measurements. For example, lymphocytes are known to exhibit low\nlevels of the \u201cside-scatter\u201d (SS) attribute, while expressing high levels of the attribute CD45 (see\ncolumn 2 of Fig. 1). More generally, virtually every cell type of interest has a known tendency\n(e.g., high or low) for most measured attributes. Therefore, it is reasonable to assume that there is an\nunderlying distribution (on distributions) governing \ufb02ow cytometry data sets, that produces roughly\nsimilar distributions thereby making possible the automation of the gating process.\n\n3 Formal Setting\nLet X denote the observation space and Y = {\u22121, 1} the output space. Let PX\u00d7Y denote the set\nof probability distributions on X \u00d7 Y, PX the set of probability distributions on X , and PY|X the\nset of conditional probabilities of Y given X (also known as Markov transition kernels from X to\n\n2\n\n\fFigure 1: Two-dimensional projections of multi-dimensional \ufb02ow cytometry data. Each row corre-\nsponds to a single patient. The distribution of cells differs from patient to patient. Lymphocytes, a\ntype of white blood cell, are marked dark (blue) and others are marked bright (green). These were\nmanually selected by a domain expert.\n\nY ) which we also call \u201cposteriors\u201d in this work. The disintegration theorem (see for instance [9],\nTheorem 6.4) tells us that (under suitable regularity properties, e.g., X is a Polish space) any element\nPXY \u2208 PX\u00d7Y can be written as a product PXY = PX\u2022PY |X, with PX \u2208 PX , PY |X \u2208 PY |X. The\nspace PX\u00d7Y is endowed with the topology of weak convergence and the associated Borel \u03c3-algebra.\nIt is assumed that there exists a distribution \u00b5 on PX\u00d7Y, where P (1)\nXY are i.i.d. realizations\nfrom \u00b5, and the sample Si is made of ni i.i.d. realizations of (X, Y ) following the distribution P (i)\nXY .\nj )1\u2264j\u2264nT , whose labels are\nNow consider a test distribution P T\n\nnot observed. A decision function is a function f : PX \u00d7 X (cid:55)\u2192 R that predicts (cid:98)Yi = f((cid:98)PX , Xi),\nwhere (cid:98)PX is the associated empirical X distribution. If (cid:96) : R \u00d7 Y (cid:55)\u2192 R+ is a loss, then the average\n\nXY and test sample ST = (X T\n\nXY , . . . , P (N )\n\nj , Y T\n\nloss incurred on the test sample is\ngeneralization error of a decision function over test samples of size nT ,\n\n, Y T\n\n1\nnT\n\ni ) . Based on this, we de\ufb01ne the average\n\n(cid:80)nT\ni=1 (cid:96)((cid:98)Y T\n\ni\n\nE(f, nT ) := E\n\nXY \u223c\u00b5\nP T\n\nE\nST \u223c(P T\n\nXY )\u2297nT\n\nX , X T\n\ni ), Y T\ni )\n\n.\n\n(1)\n\n(cid:34)\n\n1\nnT\n\nnT(cid:88)\n\ni=1\n\n(cid:96)(f((cid:98)P T\n\n(cid:35)\n\nIn important point of the analysis is that, at training time as well as at test time, the marginal dis-\ntribution PX for a sample is only known through the sample itself, that is, through the empirical\n\nmarginal (cid:98)PX. As is clear from equation (1), because of this the generalization error also depends on\nthe test sample size nT . As nT grows, (cid:98)P T\n\nX. This motivates the following gen-\neralization error when we have an in\ufb01nite test sample, where we then assume that the true marginal\nX is observed:\nP T\n\nX will converge to P T\n\n(cid:2)(cid:96)(f(P T\n\nX , X T ), Y T )(cid:3) .\n\n(2)\n\nE(f,\u221e) := E\n\nXY \u223c\u00b5\nP T\n\nE\n(X T ,Y T )\u223cP T\n\nXY\n\nTo gain some insight into this risk, let us decompose \u00b5 into two parts, \u00b5X which generates the\nmarginal distribution PX, and \u00b5Y |X which, conditioned on PX, generates the posterior PY |X. De-\nnote \u02dcX = (PX , X). We then have\nE(f,\u221e) = EPX\u223c\u00b5X\n= EPX\u223c\u00b5X\n= E\n\nEX\u223cPX\nEPY |X\u223c\u00b5Y |X\n\nEPY |X\u223c\u00b5Y |X\nEX\u223cPX\n\nEY |X\u223cPY |X\nEY |X\u223cPY |X\n\n(cid:96)(f( \u02dcX), Y )\n(cid:96)(f( \u02dcX), Y )\n\n(cid:96)(f( \u02dcX), Y )\n\n(cid:104)\n(cid:104)\n\n(cid:105)\n(cid:105)\n\n(cid:105)\n\n(cid:104)\n\n.\n\n( \u02dcX,Y )\u223cQ\u00b5\n\nHere Q\u00b5 is the distribution that generates \u02dcX by \ufb01rst drawing PX according to \u00b5X, and then drawing\nX according to PX. Similarly, Y is generated, conditioned on \u02dcX, by \ufb01rst drawing PY |X according\nto \u00b5Y |X, and then drawing Y from PY |X. From this last expression, we see that the risk is like a\nstandard binary classi\ufb01cation risk based on ( \u02dcX, Y ) \u223c Q\u00b5. Thus, we can deduce several properties\n\n3\n\n\fthat are known to hold for binary classi\ufb01cation risks. For example, if the loss is the 0/1 loss, then\nf\u2217( \u02dcX) = 2\u02dc\u03b7( \u02dcX) \u2212 1 is an optimal predictor, where \u02dc\u03b7( \u02dcX) = EY \u223cQ\u00b5\n\n(cid:2)1{Y =1}(cid:3). More generally,\n\nE(f,\u221e) \u2212 E(f\u2217,\u221e) = E \u02dcX\u223cQ\u00b5\n\n\u02dcX\n\n(cid:104)\n\n1{sign(f ( \u02dcX))(cid:54)=sign(f\u2217( \u02dcX))}|2\u02dc\u03b7( \u02dcX) \u2212 1|(cid:105)\n\nY | \u02dcX\n\n.\n\nOur goal is a learning rule that asymptotically predicts as well as the global minimizer of (2), for\na general loss (cid:96). By the above observations, consistency with respect to a general (cid:96) (thought of\nas a surrogate) will imply consistency for the 0/1 loss, provided (cid:96) is classi\ufb01cation calibrated [10].\nDespite the similarity to standard binary classi\ufb01cation in the in\ufb01nite sample case, we emphasize that\n\nthe learning task here is different, because the realizations ((cid:101)Xij, Yij) are neither independent nor\n\nidentically distributed.\nXY , the classi\ufb01er\nFinally, we note that there is a condition where for \u00b5-almost all test distribution P T\nf\u2217(P T\nX , .) (where f\u2217 is the global minimizer of (2)) coincides with the optimal Bayes classi\ufb01er for\nXY , although no labels from this test distribution are observed. This condition is simply that the\nP T\nposterior PY |X is (\u00b5-almost surely) a function of PX. In other words, with the notation introduced\nabove, \u00b5Y |X(PX) is a Dirac delta for \u00b5-almost all PX. Although we will not be assuming this\ncondition throughout the paper, it is implicitly assumed in the motivating application presented in\nSection 2, where an expert labels the data points by just looking at their marginal distribution.\nLemma 3.1. For a \ufb01xed distribution PXY , and a decision function f : X \u2192 R, let us denote\nR(f, PXY ) = E(X,Y )\u223cPXY [(cid:96)(f(X), Y )] and\n\nR\u2217(PXY ) := min\n\nf :X\u2192RR(f, PXY ) = min\nf :X\u2192R\n\nE(X,Y )\u223cPXY [(cid:96)(f(X), Y )]\n\nthe corresponding optimal (Bayes) risk for the loss function (cid:96). Assume that \u00b5 is a distribution on\nPX\u00d7Y such that \u00b5-a.s. it holds PY |X = F (PX) for some deterministic mapping F . Let f\u2217 be a\nminimizer of the risk (2). Then we have for \u00b5-almost all PXY :\n\nand\n\nR(f\u2217(PX , .), PXY ) = R\u2217(PXY )\n\nE(f\u2217,\u221e) = EPXY \u223c\u00b5 [R\u2217(PXY )] .\n\nProof. Straightforward. Obviously for any f : PX \u00d7 X \u2192 R, one has for all PXY :\nR(f(PX , .), PXY ) \u2265 R\u2217(PXY ). For any \ufb01xed PX \u2208 PX , consider PXY := PX \u2022 F (PX) and\ng\u2217(PX) a Bayes classi\ufb01er for this joint distribution. Pose f(PX , x) := g\u2217(PX)(x). Then f coin-\ncides for \u00b5-almost PXY with a Bayes classi\ufb01er for PXY , achieving equality in the above inequality.\nThe second equality follows by taking expectation over PXY \u223c \u00b5.\n\n4 Learning Algorithm\n\nWe consider an approach based on positive semi-de\ufb01nite kernels, or simply kernels for short. Back-\nground information on kernels, including the de\ufb01nition, normalized kernels, universal kernels, and\nreproducing kernel Hilbert spaces (RKHSs), may be found in [11]. Several well-known learning\nalgorithms, such as support vector machines and kernel ridge regression, may be viewed as min-\nimizers of a norm-regularized empirical risk over the RKHS of a kernel. A similar development\nalso exists for multi-task learning [3]. Inspired by this framework, we consider a general kernel\nalgorithm as follows.\nassociated RKHS. For the sample Si let (cid:98)P (i)\nConsider the loss function (cid:96) : R \u00d7 Y \u2192 R+. Let k be a kernel on PX \u00d7 X , and let Hk be the\nXijs. Also consider the extended input space PX \u00d7 X and the extended data (cid:101)Xij = ((cid:98)P (i)\nX denote the corresponding empirical distribution of the\nNote that (cid:98)P (i)\nX , Xij).\nN(cid:88)\n\nX plays a role similar to the task index in multi-task learning. Now de\ufb01ne\n\n(cid:96)(f((cid:101)Xij), Yij) + \u03bb(cid:107)f(cid:107)2 .\n\n(cid:98)f\u03bb = arg min\n\n(3)\n\n1\nN\n\nf\u2208Hk\n\n1\nni\n\ni=1\n\nni(cid:88)\n\nj=1\n\n4\n\n\fFor the hinge loss, by the representer theorem [12] this optimization problem reduces to a quadratic\nprogram equivalent to the dual of a kind of cost-sensitive SVM, and therefore can be solved using\nexisting software packages. The \ufb01nal predictor has the form\n\n(cid:98)f\u03bb((cid:98)PX , x) =\n\nN(cid:88)\n\nni(cid:88)\n\n\u03b1ijYijk(((cid:98)P (i)\n\nX , Xij), ((cid:98)PX , x))\n\nwhere the \u03b1ij are nonnegative and mostly zero. See [11] for details.\nIn the rest of the paper we will consider a kernel k on PX \u00d7 X of the product form\n\ni=1\n\nj=1\n\nk((P1, x1), (P2, x2)) = kP (P1, P2)kX(x1, x2),\n\n(4)\nwhere kP is a kernel on PX and kX a kernel on X . Furthermore, we will consider kernels on\nX denote a kernel on X (which might be different from kX) that is\nPX of a particular form. Let k(cid:48)\nmeasurable and bounded. We de\ufb01ne the following mapping \u03a8 : PX \u2192 Hk(cid:48)\n\n:\n\nX\n\nPX (cid:55)\u2192 \u03a8(PX) :=\n\nX(x,\u00b7)dPX(x).\nk(cid:48)\n\n(5)\n\n(cid:90)\n\nX\n\nThis mapping has been studied in the framework of \u201ccharacteristic kernels\u201d [13], and it has been\nproved that there are important links between universality of k(cid:48)\nNote that the mapping \u03a8 is linear. Therefore,\n(cid:104)\u03a8(PX), \u03a8(P (cid:48)\nwe introduce yet another kernel K on Hk(cid:48)\nkP (PX , P (cid:48)\n\nX) =\nX)(cid:105), it is a linear kernel on PX and cannot be a universal kernel. For this reason,\n\nand consider the kernel on PX given by\nX) = K (\u03a8(PX), \u03a8(P (cid:48)\n\nX and injectivity of \u03a8 [14, 15].\nif we consider the kernel kP (PX , P (cid:48)\n\nX)) .\n\n(6)\n\nX\n\nNote that particular kernels inspired by the \ufb01nite dimensional case are of the form\n\nK(v, v(cid:48)) = F ((cid:107)v \u2212 v(cid:48)(cid:107)),\n\n(7)\n\nor\n\nK(v, v(cid:48)) = G((cid:104)v, v(cid:48)(cid:105)),\n\n(8)\nwhere F, G are real functions of a real variable such that they de\ufb01ne a kernel. For example, F (t) =\nexp(\u2212t2/(2\u03c32)) yields a Gaussian-like kernel, while G(t) = (1 + t)d yields a polynomial-like\nkernel. Kernels of the above form on the space of probability distributions over a compact space X\nhave been introduced and studied in [16]. Below we apply their results to deduce that k is a universal\nkernel for certain choices of kX , k(cid:48)\n\nX, and K.\n\n5 Learning Theoretic Study\n\nAlthough the regularized estimation formula (3) de\ufb01ning (cid:98)f\u03bb is standard, the generalization error\nanalysis is not, since the (cid:101)Xij are neither identically distributed nor independent. We begin with a\ngeneralization error bound that establishes uniform estimation error control over functions belonging\nto a ball of Hk . We then discuss universal kernels, and \ufb01nally deduce universal consistency of the\nalgorithm. To simplify somewhat the analysis, we assume below that all training samples have the\nsame size ni = n. Also let Bk(r) denote the closed ball of radius r, centered at the origin, in the\nRKHS of the kernel k. We consider the following assumptions on the loss and kernels:\n(Loss) The loss function (cid:96) : R \u00d7 Y \u2192 R+ is L(cid:96)-Lipschitz in its \ufb01rst variable and bounded by B(cid:96).\nk(cid:48) \u2265 1, and\n(Kernels-A) The kernels kX , k(cid:48)\n\u2192 HK associated to K satis\ufb01es a\n\nX and K are bounded respectively by constants B2\n\nK . In addition, the canonical feature map \u03a6K : Hk(cid:48)\nB2\nH\u00a8older condition of order \u03b1 \u2208 (0, 1] with constant LK, on Bk(cid:48)\n\n(Bk(cid:48)) :\n\nk, B2\n\nX\n\n\u2200v, w \u2208 Bk(cid:48)\n\nX\n\n(Bk(cid:48)) :\n\n(cid:107)\u03a6K(v) \u2212 \u03a6K(w)(cid:107) \u2264 LK (cid:107)v \u2212 w(cid:107)\u03b1 .\n\nX\n\n(9)\n\nSuf\ufb01cient conditions for (9) are described in [11]. As an example, the condition is shown to hold\nwith \u03b1 = 1 when K is the Gaussian-like kernel on Hk(cid:48)\n. The boundedness assumptions are also\nclearly satis\ufb01ed for Gaussian kernels.\n\nX\n\n5\n\n\fXY , . . . , P (N )\n\nTheorem 5.1 (Uniform estimation error control). Assume conditions (Loss) and (Kernels-A) hold.\nIf P (1)\nXY are i.i.d. realizations from \u00b5, and for each i = 1, . . . , N, the sample Si =\n(Xij, Yij)1\u2264j\u2264n is made of i.i.d. realizations from P (i)\nXY , then for any R > 0, with probability at\nleast 1 \u2212 \u03b4:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nN\n\nsup\n\nf\u2208Bk(R)\n\nn(cid:88)\n\nj=1\n\n1\nn\n\nN(cid:88)\n(cid:32)\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:96)(f((cid:101)Xij), Yij) \u2212 E(f,\u221e)\n(cid:32)\n(cid:18)log N + log \u03b4\u22121\n\n\u2264 c\n\nRBkL(cid:96)\n\nBk(cid:48)LK\n\nn\n\n(cid:19) \u03b1\n\n2\n\n(cid:33)\n\n(cid:114)log \u03b4\u22121\n\n(cid:33)\n\nN\n\n+ B(cid:96)\n\n,\n\n(10)\n\n+ BK\n\n1\u221a\nN\n\nwhere c is a numerical constant, and Bk(R) denotes the ball of radius R of Hk .\n\nProof sketch. The full proofs of this and other results are given in [11]. We give here a brief\noverview. We use the decomposition\n\nBounding (I), using the Lipschitz property of the loss function, can be reduced to controlling\n\nN\n\nsup\n\n1\nni\n\ni=1\n\nj=1\n\nf\u2208Bk(R)\n\nN(cid:88)\n\nni(cid:88)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\n\u2264 sup\nf\u2208Bk(R)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:96)(f((cid:101)Xij), Yij) \u2212 E(f,\u221e)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n(cid:16)\nni(cid:88)\nN(cid:88)\n(cid:96)(f((cid:98)P (i)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\nni(cid:88)\nN(cid:88)\n(cid:13)(cid:13)(cid:13)f((cid:98)P (i)\nof the kernel k, the convergence of \u03a8((cid:98)P (i)\n\nX , .) \u2212 f(P (i)\n\nconditional to P (i)\n\nf\u2208Bk(R)\n\n+ sup\n\n1\nni\n\ni=1\n\nj=1\n\ni=1\n\nj=1\n\n1\nni\n\nN\n\nN\n\nX , .)\n\n(cid:13)(cid:13)(cid:13)\u221e ,\n\nX , Xij), Yij) \u2212 (cid:96)(f(P (i)\n\n(cid:96)(f(P (i)\n\nX , Xij), Yij) \u2212 E(f,\u221e)\n\nX , Xij), Yij)\n\n(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =: (I) + (II).\n\nX , uniformly for i = 1, . . . , N. This can be obtained using the reproducing property\nX ) as a consequence of Hoeffding\u2019s inequality\n\nX ) to \u03a8(P (i)\n\nX ), and the convergence of the conditional generalization error.\n\nin a Hilbert space, and the other assumptions (boundedness/H\u00a8older property) on the kernels.\nConcerning the control of the term (II), it can be decomposed in turn into the convergence con-\nditional to (P (i)\nIn both cases, a\nstandard approach using the Azuma-McDiarmid\u2019s inequality [17] followed by symmetrization and\nRademacher complexity analysis on a kernel space [18, 19] can be applied. For the \ufb01rst part, the\nrandom variables are the (Xij, Yij) (which are independent conditional to (P (i)\nX )); for the second\npart, the i.i.d. variables are the (P (i)\n\nX ) (the (Xij, Yij) being integrated out).\nTo establish that k is universal on PX \u00d7 X , the following lemma is useful.\nLemma 5.2. Let \u2126, \u2126(cid:48) be two compact spaces and k, k(cid:48) be kernels on \u2126, \u2126(cid:48), respectively. If k, k(cid:48)\nare both universal, then the product kernel\n\nis universal on \u2126 \u00d7 \u2126(cid:48).\n\nk((x, x(cid:48)), (y, y(cid:48))) := k(x, y)k(cid:48)(x(cid:48), y(cid:48))\n\nSeveral examples of universal kernels are known on Euclidean space. We also need universal kernels\non PX . Fortunately, this was recently investigated [16]. Some additional assumptions on the kernels\nand feature space are required:\n(Kernels-B) kX, k(cid:48)\n\nX, K, and X satisfy the following: X is a compact metric space; kX is universal\nX is continuous and universal on X ; K is universal on any compact subset of Hk(cid:48)\n.\n\non X ; k(cid:48)\n\nX\n\n6\n\n\fAdapting the results of [16], we have the following.\nTheorem 5.3 (Universal kernel). Assume condition (Kernels-B) holds. Then, for kP de\ufb01ned as in\n(6), the product kernel k in (4) is universal on PX \u00d7X . Furthermore, the assumption on K is ful\ufb01lled\nif K is of the form (8), where G is an analytical function with positive Taylor series coef\ufb01cients, or if\nK is the normalized kernel associated to such a kernel.\nAs an example, suppose that X is a compact subset of Rd. Let kX and k(cid:48)\nX be Gaussian kernels on\nX)(cid:105)Hk(cid:48)\nX . Taking G(t) = exp(t), it follows that K(PX , P (cid:48)\nX) = exp((cid:104)\u03a8(PX), \u03a8(P (cid:48)\n) is universal on\nPX . By similar reasoning as in the \ufb01nite dimensional case, the Gaussian-like kernel K(PX , P (cid:48)\nX) =\nexp(\u2212 1\n) is also universal on PX . Thus the product kernel is universal.\nCorollary 5.4 (Universal consistency). Assume the conditions (Loss), (Kernels-A), and (Kernels-\nB) are satis\ufb01ed. Assume that N, n grow to in\ufb01nity in such a way that N = O(n\u03b3) for some \u03b3 > 0.\nThen, if \u03bbj is a sequence such that \u03bbj \u2192 0 and \u03bbj\n\n(cid:113) j\nlog j \u2192 \u221e, it holds that\nE((cid:98)f\u03bbmin(N,n\u03b1) ,\u221e) \u2192 inf\nf :PX \u00d7X\u2192RE(f,\u221e)\n\n2\u03c32(cid:107)\u03a8(PX) \u2212 \u03a8(P (cid:48)\n\nX)(cid:107)2Hk(cid:48)\n\nX\n\nX\n\nin probability.\n\n6 Experiments\n\nTrain\n1.41\n1.59\n1.34\n1.32\n\nkP\nPooling (\u03c4 = 1)\nMTL (\u03c4 = 0.01)\nMTL (\u03c4 = 0.5)\nProposed\n\nWe demonstrate the proposed methodology for \ufb02ow cytometry data auto-gating, described above.\nPeripheral blood samples were obtained from 35 normal patients, and lymphocytes were classi\ufb01ed\nby a domain expert. The corresponding \ufb02ow cytometry data sets have sample sizes ranging from\n10,000 to 100,000, and the proportion of lymphocytes in each data set ranges from 10 to 40%. We\ntook N = 10 of these data sets for training, and the remaining 25 for testing. To speed training time,\nwe subsampled the 10 training data sets to have 1000 data points (cells) each. Adopting the hinge\nloss, we used the SVMlight [20] package to solve the quadratic program characterizing the solution.\nThe kernels kX, k(cid:48)\nX, and K are all taken to be Gaus-\nsian kernels with respective bandwidths \u03c3X, \u03c3(cid:48)\nX, and\nX equals 10 times the average\n\u03c3. We set \u03c3X such that \u03c32\ndistance of a data point to its nearest neighbor within\nthe same data set. The second bandwidth was de\ufb01ned\nsimilarly, while the third was set to 1. The regulariza-\ntion parameter \u03bb was set to 1.\nFor comparison, we also considered three other\noptions for kP .\nThese kernels have the form\nkP (P1, P2) = 1 if P1 = P2, and kP (P1, P2) = \u03c4\notherwise. When \u03c4 = 1, the method is equivalent to\npooling all of the training data together in one data set,\nand learning a single SVM classi\ufb01er. This idea has\nbeen previously studied in the context of \ufb02ow cytome-\ntry by [21]. When 0 < \u03c4 < 1, we obtain a kernel like what was used for multi-task learning (MTL)\nby [3]. Note that these kernels have the property that if P1 is a training data set, and P2 a test data set,\nthen P1 (cid:54)= P2 and so kP (P1, P2) is simply a constant. This implies that the learning rules produced\nby these kernels do not adapt to the test distribution, unlike the proposed kernel. In the experiments,\nwe take \u03c4 = 1 (pooling), 0.01, and 0.5 (MTL).\nThe results are shown in Fig. 2 and summarized in Table 1. The middle column of the table reports\nthe average misclassi\ufb01cation rate on the training data sets. Here we used those data points that\nwere not part of the 1000-element subsample used for training. The right column shows the average\nmisclassi\ufb01cation rate on the test data sets.\n\nTable 1: The misclassi\ufb01cation rates (%) on\ntraining data sets and test data sets for dif-\nferent kP . The proposed method adapts the\ndecision function to the test data (through\nthe marginal-dependent kernel), account-\ning for its improved performance.\n\nTest\n2.32\n2.64\n2.36\n2.29\n\n7 Discussion\nOur approach to learning marginal predictors relies on the extended input pattern \u02dcX = (PX , X).\nThus, we study the natural algorithm of minimizing a regularized empirical loss over a reproducing\n\n7\n\n\fFigure 2: The misclassi\ufb01cation rates (%) on training data sets and test data sets for different kP . The\nlast 25 data sets separated by dotted line are not used during training.\n\nkernel Hilbert space associated with the extended input domain PX \u00d7X . We also establish universal\nconsistency, using a novel generalization error analysis under the inherent non-iid sampling plan,\nand a construction of a universal kernel on PX \u00d7 X . For the hinge loss, the algorithm may be\nimplemented using standard techniques for SVMs. The algorithm is applied to \ufb02ow cytometry auto-\ngating, and shown to improve upon kernels that do not adapt to the test distribution.\nSeveral future directions exist. From an application perspective, the need for adaptive classi\ufb01ers\narises in many applications, especially in biomedical applications involving biological and/or tech-\nnical variation in patient data. For example, when electrocardiograms are used to monitor cardiac\npatients, it is desirable to classify each heartbeat as irregular or not. However, irregularities in a test\npatient\u2019s heartbeat will differ from irregularities of historical patients, hence the need to adapt to the\ntest distribution [22].\nWe can also ask how the methodology and analysis can be extended to the context where a small\nnumber of labels are available for the test distribution, as is commonly assumed in transfer learning.\nIn this setting, two approaches are possible. The simplest one is to use the same optimization prob-\nlem (3), wherein we include additionally the labeled examples of the test distribution. However, if\nseveral test samples are to be treated in succession, and we want to avoid a full, resource-consuming\nre-training using all the training samples each time, an interesting alternative is the following: learn\nonce a function f0(PX , x) using the available training samples via (3); then, given a partially labeled\ntest sample, learn a decision function on this sample only via the usual kernel norm regularized em-\npirical loss minimization method, but replace the usual regularizer term (cid:107)f(cid:107)2 by (cid:107)f \u2212 f0(Px, .)(cid:107)2\n(note that f0(Px, .) \u2208 Hk). In this sense, the marginal-adaptive decision function learned from the\ntraining samples would serve as a \u201cprior\u201d for learning on the test data.\nIt would also be of interest to extend the proposed methodology to a multi-class setting. In this case,\nthe problem has an interesting interpretation in terms of \u201clearning to cluster.\u201d Each training task may\nbe viewed as a data set that has been clustered by a teacher. Generalization then entails the ability\nto learn the clustering process, so that clusters may be assigned to a new unlabeled data set.\nFuture work may consider other asymptotic regimes, e.g., where {ni}, nT do not tend to in\ufb01nity,\nor they tend to in\ufb01nity much slower than N. It may also be of interest to develop implementations\nfor differentiable losses such as the logistic loss, allowing for estimation of posterior probabilities.\nFinally, we would like to specify conditions on \u00b5, the distribution-generating distribution, that are\nfavorable for generalization (beyond the simple condition discussed in Lemma 3.1).\n\nAcknowledgments\n\nG. Blanchard was supported by the European Community\u2019s 7th Framework Programme under\nthe PASCAL2 Network of Excellence (ICT-216886) and under the E.U. grant agreement 247022\n(MASH Project). G. Lee and C. Scott were supported in part by NSF Grant No. 0953135.\n\n8\n\n\fReferences\n[1] S. Thrun, \u201cIs learning the n-th thing any easier than learning the \ufb01rst?,\u201d Advances in Neural\n\nInformation Processing Systems, pp. 640\u2013646, 1996.\n\n[2] R. Caruana, \u201cMultitask learning,\u201d Machine Learning, vol. 28, pp. 41\u201375, 1997.\n[3] T. Evgeniou and M. Pontil, \u201cLearning multiple tasks with kernel methods,\u201d J. Machine Learn-\n\ning Research, pp. 615\u2013637, 2005.\n\n[4] S. Bickel, M. Br\u00a8uckner, and T. Scheffer, \u201cDiscriminative learning under covariate shift,\u201d J.\n\nMachine Learning Research, pp. 2137\u20132155, 2009.\n\n[5] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence, Dataset Shift in\n\nMachine Learning, The MIT Press, 2009.\n\n[6] R. K. Ando and T. Zhang, \u201cA high-performance semi-supervised learning method for text\nchunking,\u201d Proceedings of the 43rd Annual Meeting on Association for Computational Lin-\nguistics (ACL 05), pp. 1\u20139, 2005.\n\n[7] A. Rettinger, M. Zinkevich, and M. Bowling, \u201cBoosting expert ensembles for rapid concept\nrecall,\u201d Proceedings of the 21st National Conference on Arti\ufb01cial Intelligence (AAAI 06), vol.\n1, pp. 464\u2013469, 2006.\n\n[8] A. Arnold, R. Nallapati, and W.W. Cohen, \u201cA comparative study of methods for transductive\ntransfer learning,\u201d Seventh IEEE International Conference on Data Mining Workshops, pp.\n77\u201382, 2007.\n\n[9] O. Kallenberg, Foundations of Modern Probability, Springer, 2002.\n[10] P. Bartlett, M. Jordan, and J. McAuliffe, \u201cConvexity, classi\ufb01cation, and risk bounds,\u201d J. Amer.\n\nStat. Assoc., vol. 101, no. 473, pp. 138\u2013156, 2006.\n\n[11] G. Blanchard, G. Lee, and C. Scott, \u201cSupplemental material,\u201d NIPS 2011.\n[12] I. Steinwart and A. Christmann, Support Vector Machines, Springer, 2008.\n[13] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola, \u201cA kernel approach to com-\nparing distributions,\u201d in Proceedings of the 22nd AAAI Conference on Arti\ufb01cial Intelligence,\nR. Holte and A. Howe, Eds., 2007, pp. 1637\u20131641.\n\n[14] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola,\n\n\u201cA kernel method\nfor the two-sample-problem,\u201d in Advances in Neural Information Processing Systems 19,\nB. Sch\u00a8olkopf, J. Platt, and T. Hoffman, Eds., 2007, pp. 513\u2013520.\n\n[15] B. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00a8olkopf, and G. Lanckriet, \u201cHilbert space\nembeddings and metrics on probability measures,\u201d Journal of Machine Learning Research,\nvol. 11, pp. 1517\u20131561, 2010.\n\n[16] A. Christmann and I. Steinwart, \u201cUniversal kernels on non-standard input spaces,\u201d in Advances\nin Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,\nR. Zemel, and A. Culotta, Eds., 2010, pp. 406\u2013414.\n\n[17] C. McDiarmid, \u201cOn the method of bounded differences,\u201d Surveys in Combinatorics, vol. 141,\n\npp. 148\u2013188, 1989.\n\n[18] V. Koltchinskii, \u201cRademacher penalties and structural risk minimization,\u201d IEEE Transactions\n\non Information Theory, vol. 47, no. 5, pp. 1902 \u2013 1914, 2001.\n\n[19] P. Bartlett and S. Mendelson, \u201cRademacher and Gaussian complexities: Risk bounds and\n\nstructural results,\u201d Journal of Machine Learning Research, vol. 3, pp. 463\u2013482, 2002.\n\n[20] T. Joachims, \u201cMaking large-scale SVM learning practical,\u201d in Advances in Kernel Methods -\nSupport Vector Learning, B. Sch\u00a8olkopf, C. Burges, and A. Smola, Eds., chapter 11, pp. 169\u2013\n184. MIT Press, Cambridge, MA, 1999.\n\n[21] J. Toedling, P. Rhein, R. Ratei, L. Karawajew, and R. Spang, \u201cAutomated in-silico detection of\ncell populations in \ufb02ow cytometry readouts and its application to leukemia disease monitoring,\u201d\nBMC Bioinformatics, vol. 7, pp. 282, 2006.\n\n[22] J. Wiens, Machine Learning for Patient-Adaptive Ectopic Beat Classication, Masters The-\nsis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of\nTechnology, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1205, "authors": [{"given_name": "Gilles", "family_name": "Blanchard", "institution": null}, {"given_name": "Gyemin", "family_name": "Lee", "institution": null}, {"given_name": "Clayton", "family_name": "Scott", "institution": null}]}