{"title": "AUC optimization and the two-sample problem", "book": "Advances in Neural Information Processing Systems", "page_first": 360, "page_last": 368, "abstract": "The purpose of the paper is to explore the connection between multivariate homogeneity tests and $\\auc$ optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon  test in the one-dimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method.", "full_text": "AUC optimization and the two-sample problem\n\nSt\u00b4ephan Cl\u00b4emenc\u00b8on\n\nTelecom Paristech (TSI) - LTCI UMR Institut Telecom/CNRS 5141\n\nstephan.clemencon@telecom-paristech.fr\n\nMarine Depecker\n\nTelecom Paristech (TSI) - LTCI UMR Institut Telecom/CNRS 5141\n\nmarine.depecker@telecom-paristech.fr\n\nNicolas Vayatis\n\nENS Cachan & UniverSud - CMLA UMR CNRS 8536\n\nnicolas.vayatis@cmla.ens-cachan.fr\n\nAbstract\n\nThe purpose of the paper is to explore the connection between multivariate ho-\nmogeneity tests and AUC optimization. The latter problem has recently received\nmuch attention in the statistical learning literature. From the elementary observa-\ntion that, in the two-sample problem setup, the null assumption corresponds to the\nsituation where the area under the optimal ROC curve is equal to 1/2, we pro-\npose a two-stage testing method based on data splitting. A nearly optimal scoring\nfunction in the AUC sense is \ufb01rst learnt from one of the two half-samples. Data\nfrom the remaining half-sample are then projected onto the real line and eventu-\nally ranked according to the scoring function computed at the \ufb01rst stage. The last\nstep amounts to performing a standard Mann-Whitney Wilcoxon test in the one-\ndimensional framework. We show that the learning step of the procedure does\nnot affect the consistency of the test as well as its properties in terms of power,\nprovided the ranking produced is accurate enough in the AUC sense. The results\nof a numerical experiment are eventually displayed in order to show the ef\ufb01ciency\nof the method.\n\n1 Introduction\n\nThe statistical problem of testing homogeneity of two samples arises in a wide variety of appli-\ncations, ranging from bioinformatics to psychometrics through database attribute matching for in-\nstance. Practitioners may rely upon a wide range of nonparametric tests for detecting differences in\ndistribution (or location) between two one-dimensional samples, among which tests based on lin-\near rank statistics, such as the celebrated Mann-Whitney Wilcoxon test. Being a (locally) optimal\nprocedure, the latter is the most widely used in homogeneity testing. Such rank statistics were orig-\ninally introduced because they are distribution-free under the null hypothesis, thus permitting to set\ncritical values in a non asymptotic fashion for any given level. Beyond this simple fact, the cru-\ncial advantage of rank-based tests relies in their asymptotic ef\ufb01ciency in a variety of nonparametric\nsituations. We refer for instance to [15] for an account of asymptotically (locally) uniformly most\npowerful tests and a comprehensive treatment of asymptotic optimality of R-statistics.\nIn a different context, consider data sampled from a feature space X \u2282 Rd of high dimension with\nbinary label information in {\u22121, +1}. The problem of ranking such data, also known as the bipartite\nranking problem, has recently gained an increasing attention in the machine-learning literature, see\n\n1\n\n\f[5, 10, 19]. Here, the goal is to learn, based on a pooled set of labeled examples, how to rank\nnovel data with unknown labels, by means of a scoring function s : X \u2192 R, in order that positive\nones appear on top of the list. Over the last few years, this global learning problem has been the\nsubject of intensive research, involving issues related to the design of appropriate criteria re\ufb02ecting\nranking performance or valid extensions of the Empirical Risk Minimization approach (ERM) to\nthis framework [2, 6, 11]. In most applications, the gold standard for measuring the capacity of a\nscoring function s to discriminate between the class populations however remains the area under\nthe ROC curve criterion (AUC) and most ranking/scoring methods boil down to maximizing its\nempirical counterpart. The empirical AUC may be viewed as the Mann-Whitney statistic based on\nthe images of the multivariate samples by s, see [13, 9, 12, 18].\nThe purpose of this paper is to investigate how ranking methods for multivariate data with binary\nlabels may be exploited in order to extend the rank-based test approach for testing homogeneity\nbetween two samples to a multidimensional setting. Precisely, the testing principle promoted in this\npaper is described through an extension of the Mann-Whitney Wilcoxon test, based on a preliminary\nranking of the data through empirical AUC maximization. The consistency of the test is proved to\nhold, as soon as the learning procedure is consistent in the AUC sense and its capacity to detect\n\u201dsmall\u201d deviations from the homogeneity assumption is illustrated by a simulation example.\nThe rest of the paper is organized as follows. In Section 2, the homogeneity testing problem is\nformulated and standard approaches are recalled, with focus on the one-dimensional case. Section\n3 highlights the connection of the two-sample problem with optimal ROC curves and gives some\ninsight to our appproach. In Section 4, we describe the testing procedure proposed and set prelimi-\nnary grounds for its theoretical validity. Simulation results are presented in Section 5 and technical\ndetails are deferred to the Appendix.\n\n2 The two-sample problem\n\n. . . , X +\n\n. . . , X\u2212\n\nWe start off by setting out the notations needed throughout the paper and formulate the two-sample\nproblem precisely. We recall standard approaches to homogeneity testing. In particular, special\nattention is paid to the one-dimensional case, for which two-sample linear rank statistics allow for\nconstructing locally optimal tests in a variety of situations.\nProbabilistic setup. The problem considered in this paper is to test the hypothesis that two inde-\npendent i.i.d. random samples, valued in Rd with d \u2265 1, X +\nn and X\u2212\nm are\n1 ,\n1 ,\nidentical in distributions. We denote by G(dx) the distribution function of the X +\ni \u2019s, while the one\nof the X\u2212\nj \u2019s is denoted by H(dx). We also denote by P(G,H) the probability distribution on the\nunderlying space. The testing problem is tackled here from a nonparametric perspective, meaning\nthat the distributions G(dx) and H(dx) are assumed to be unknown. We suppose in addition that\nG(dx) and H(dx) are continuous distributions and the asymptotics are described as follows: we set\nN = m + n and suppose that n/N \u2192 p \u2208 (0, 1) as n, m tend to in\ufb01nity. Formally, the problem\nis to test the null hypothesis H0 : G = H against the alternative H1 : G (cid:54)= H, based on the two\ndata sets. In this paper, we place ourselves in the dif\ufb01cult case where G and H have same support,\nX \u2282 Rd say.\nMeasuring dissimilarity. A possible approach is to consider a probability (pseudo)-metric D on the\nspace of probability distributions on Rd. Based on the simple observation that D(G, H) = 0 under\n\nthe null hypothesis, possible testing procedures consist of computing estimates (cid:98)Gn and (cid:98)Hm of the\nunderlying distributions and rejecting H0 for \u201dlarge\u201d values of the statistic D((cid:98)Gn, (cid:98)Hm), see [3] for\n\ninstance. Beyond computational dif\ufb01culties and the necessity of identifying a proper standardization\nin order to make the statistic asymptotically pivotal (i.e. its limit distribution is parameter free), the\nmajor issue one faces when trying to implement such plug-in procedures is related to the curse of\ndimensionality. Indeed, plug-in procedures involve the consistent estimation of distributions on a\nfeature space of possibly very large dimension d \u2208 N\u2217.\nVarious metrics or pseudo-metrics can be considered for measuring dissimilarity between two proba-\nbility distributions. We refer to [17] for an excellent account of metrics in spaces of probability mea-\nsures and their applications. Typical examples include the chi-square distance, the Kullback-Leibler\ndivergence, the Hellinger distance, the Kolmogorov-Smirnov distance and its generalizations of the\n\n2\n\n\f(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n\n(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:12) ,\n\nfollowing type\n\nf(x)G(dx) \u2212\n\nx\u2208X\n\nf(x)H(dx)\n\nMMD(G, H) = sup\nf\u2208F\n\n(1)\nwhere F denotes a supposedly rich enough class of functions f : X \u2282 Rd \u2192 R, so that\nMMD(G, H) = 0 if and only if G = H. The quantity (1) is called the Maximum Mean Dis-\ncrepancy in [1], where a unit ball of a reproducing kernel Hilbert space H is chosen for F in order\nto allow for ef\ufb01cient computation of the supremum (1), see also [23]. The view promoted in the\npresent paper for the two-sample problem is very different in nature and is inspired from traditional\nprocedures in the particular one-dimensional case.\nThe one-dimensional case. A classical approach to the two-sample problem in the one-dimensional\nsetup lies in ordering the observed data using the natural order on the real line R and then basing the\ndecision depending on the ranks of the positive instances among the pooled sample:\n\n\u2200i \u2208 {1, . . . , n}, Ri = N Fn,m(X +\ni ),\n\nwhere Fn,m(t) = (n/N)(cid:98)Gn(t) + (m/N)(cid:98)Hm(t), and denoting by (cid:98)Gn(t) = n\u22121(cid:80)\nand (cid:98)Hn(t) = m\u22121(cid:80)\n\ni \u2264 t}\ni \u2264 t} the empirical counterparts of the cumulative distribution\nfunctions G and H respectively. This approach is grounded in invariance considerations, practical\nsimplicity and optimality of tests based on R-estimates for this problem, depending on the class\nof alternative hypotheses considered. Assuming the distributions G and H continuous, the idea\nunderlying such tests lies in the simple fact that, under the null hypothesis, the ranks of positive\ninstances are uniformly distributed over {1, . . . , N}. A popular choice is to consider the sum of\n\u201dpositive ranks\u201d, leading to the well-known rank-sum Wilcoxon statistic [22]\n\nI{X\u2212\n\nI{X +\n\ni\u2264n\n\ni\u2264n\n\nn(cid:88)\n\n(cid:99)Wn,m =\n\nRi,\n\ni=1\n\nwhich is distribution-free under H0, see Section 6.9 in [15] for further details. We also recall that,\nthe validity framework of the rank-sum test classically extends to the case where some observations\nare tied (i.e. when G and/or H may be degenerate at some points), by assigning the mean rank to ties\n[4]. We shall denote by Wn,m the distribution of the (average rank version of the) Wilcoxon statistic\n\n(cid:99)Wn,m under the homogeneity hypothesis. Since tables for the distributions Wn,m are available, no\nrecalled below, the test based on the R-statistic(cid:99)Wn,m has appealing optimality properties for certain\n\nasymptotic approximation result is thus needed for building a test of appropriate level. As it will be\nclasses of alternatives. Although R-estimates (i.e. functions of the Ri\u2019s) form a very rich collection\nof statistics, but, for lack of space, we restrict our attention to the two-sample Wilcoxon statistic in\nthis paper.\nHeuristics. We may now give a \ufb01rst insight into the way we shall tackle the problem in the multi-\ndimensional case. Suppose that we are able to \u201dproject\u201d the multivariate sampling data onto the real\nline through a certain scoring function s : Rd \u2192 R in order to preserve the possible dissimilarity\n(considered in a certain speci\ufb01c sense, which we shall discuss below) between the two populations,\nleading then to \u201dlarge\u201d values of the score s(x) for the positive instances and \u201dsmall\u201d values for the\nnegative ones with high probability. Now that the dimension of the problem has been brought down\nto 1, observations can be ranked and one may perform for instance a basic two-sample Wilcoxon\ntest based on the data sets s(X +\n\n1 ), . . . , s(X\u2212\nm).\n\nn ) and s(X\u2212\n\n1 ), . . . , s(X +\n\nRemark 1 (LEARNING A STUDENT t TEST.) We point out that it is precisely the task Linear\nDiscriminant Analysis (LDA) tries to performs, in a restrictive Gaussian framework however (when\nunivariate Student t test based on the \u201dprojected\u201d data {(cid:98)\u03b4(X +\nG and H are normal distributions with same covariance structure namely). In order to test deviations\nfrom the homogeneity hypothesis on the basis of the original samples, one may consider applying a\ni \u2264 m}, where(cid:98)\u03b4 denotes the empirical discriminant function, this may be shown as an appealing\ni ) : 1 \u2264\n\ni ) : 1 \u2264 i \u2264 n} and {(cid:98)\u03b4(X\u2212\n\nalternative to multivariate extensions of the standard t test [14].\n\nThe goal of this paper is to show how to exploit recent advances in ROC/AUC optimization for\nextending this heuristics to more general situations than the parametric one mentioned above.\n\n3\n\n\f3 Connections with bipartite ranking\n\nROC curves are among the most widely used graphical tools for visualizing the dissimilarity be-\ntween two one-dimensional distributions in a large variety of applications such as anomaly detection\nin signal analysis, medical diagnosis, information retrieval, etc. As this concept is at the heart of\nthe ranking issue in the binary setting, which forms the \ufb01rst stage of the testing procedure sketched\nabove, we recall its de\ufb01nition precisely.\nDe\ufb01nition 1 (ROC curve) Let g and h be two cumulative distribution functions on R. The ROC\ncurve related to the distributions g(dt) and h(dt) is the graph of the mapping:\n\nROC ((g, h), \u00b7) : \u03b1 \u2208 [0, 1] (cid:55)\u2192 1 \u2212 g \u25e6 h\u22121(1 \u2212 \u03b1),\n\ndenoting by f\u22121(u) = inf{t \u2208 R : f(t) \u2265 u} the generalized inverse of any c`ad-l`ag function\nf : R \u2192 R. When the distributions g(dt) and h(t) are continuous, it can alternatively be de\ufb01ned as\nthe parametric curve t \u2208 R (cid:55)\u2192 (1 \u2212 h(t), 1 \u2212 g(t)).\nOne may show that ROC ((g, h), \u00b7) is above the diagonal \u2206 : \u03b1 \u2208 [0, 1] (cid:55)\u2192 \u03b1 of the ROC space if\nand only if the distribution g is stochastically larger than h and it is concave as soon as the likelihood\nratio dg/dh is increasing. When g(dt) and h(dt) are both continuous, the curves ROC((g, h), .) and\nROC((h, g), .) are symmetric with respect to the diagonal of the ROC space with slope equal to one.\nRefer to [9] for a detailed list of properties of ROC curves.\nThe notion of ROC curve provides a functional measure of dissimilarity between distributions on\nR: the closer to the corners of the unit square the curve ROC ((g, h), \u00b7) is, the more dissimilar the\ndistributions g and h are. For instance, it exactly coincides with the upper left-hand corner of the unit\nsquare, namely the curve \u03b1 \u2208 [0, 1] (cid:55)\u2192 I{\u03b1 \u2208]0, 1]}, when there exists l \u2208 R such that the support\nof distribution g(dt) is a subset of [l, \u221e[, while ]l,\u2212\u221e, ] contains the support of h. In contrast, it\nmerges with the diagonal \u2206 when g = h. Hence, distance of ROC ((g, h), \u00b7) to the diagonal may\nbe naturally used to quantify departure from the homogeneous situation. The L1-norm provides a\nconvenient way of measuring such a distance, leading to the classical AUC criterion (AUC standing\nfor area under the ROC curve):\n\nAUC(g, h) =\n\nROC ((g, h), \u03b1) d\u03b1.\n\n\u03b1=0\n\nThe popularity of this summary quantity arises from the fact that it can be interpreted in a proba-\nbilistic fashion, and may be viewed as a distance between the locations of the two distributions. In\nthis respect, we recall the following result.\nProposition 1 Let g and h be two distributions on R. We have:\n1\nP{Z = Z(cid:48)} =\n2\n\nAUC(g, h) = P{Z > Z(cid:48)} +\n\n+ E[h(Z)] \u2212 E[g(Z(cid:48))],\n\n1\n2\n\nwhere Z and Z(cid:48) denote independent random variables, drawn from g(dt) and h(dt) respectively.\n\nWe recall that the homogeneous situation corresponds to the case where AUC(g, h) = 1/2 and the\nMann-Withney statistic [16]\n\n(cid:90) 1\n\nn(cid:88)\n\nm(cid:88)\n\n(cid:18)\n\ni=1\n\nj=1\n\nUn,m =\n\n1\nnm\n\nI{X\u2212\n\nj < X +\n\ni } +\n\n1\n2\n\nI{X\u2212\n\n(cid:19)\ni }\nj = X +\n\nis exactly the empirical counterpart of AUC(g, h). It yields exactly the same statistical decisions as\nthe two-sample Wilcoxon statistic, insofar they are related as follows:\n\nWn,m = nm(cid:98)Un,m + n(n + 1)/2.\n\nFor this reason, the related test of hypotheses is called Mann-Whitney Wilcoxon test (MWW).\nMultidimensional extension. In the multivariate setup, the notion of ROC curve can be extended\nthe following way. Let H(dx) and G(dx) be two given distributions on Rd and S = {s : X \u2192 R |\n\n4\n\n\fs Borel measurable}. For any scoring function s \u2208 S, we denote by Hs(dt) and Gs(t) the images\nof H(dx) and G(x) by the mapping s(x). In addition, we set for all s \u2208 S:\n\nROC(s, .) = ROC((Gs, Hs), .) and AUC(s) = AUC(Gs, Hs).\n\nClearly, the families of univariate distributions {Gs}s\u2208S and {Hs}s\u2208S entirely characterize the\nmultivariate probability measures G and H. One may thus consider evaluating the dissimilarity\nbetween H(dx) and G(dx) on Rd through the family of curves {ROC(s, .)}s\u2208S or through the\ncollection of scalar values {AUC(s)}s\u2208S. Going back to the homogeneity testing problem, the null\nassumption may be reformulated as\n\n\u201dH0 : \u2200s \u2208 S, AUC(s) = 1/2\u201d versus \u201dH1 : \u2203s \u2208 S such that AUC(s) > 1/2\u201d.\n\nThe next result, following from standard Neyman-Pearson type arguments, shows that the supremum\nsups\u2208S AUC(s) is attained by increasing transforms of the likelihood ratio \u03c6(x) = dG/dH(x),\nx \u2208 X . Scoring functions with largest AUC are natural candidates for detecting the alternative H1.\nTheorem 1 (OPTIMAL ROC CURVE.) The set of S\u2217 = {T \u25e6 \u03c6 | T : R \u2192 R strictly increasing }\nde\ufb01nes the collection of optimal scoring functions in the sense that: \u2200s \u2208 S,\n\n\u2200\u03b1 \u2208 [0, 1], ROC(s, \u03b1) \u2264 ROC\u2217(\u03b1) and AUC(s) \u2264 AUC\u2217,\nwith the notations ROC\u2217(.) = ROC(s\u2217, .) and AUC\u2217 = AUC(s\u2217) for s\u2217 \u2208 S\u2217.\n\nRefer to Proposition 4\u2019s proof in [9] for a detailed argument. Notice that, as dG/dH(X) =\ndG\u03c6(X)/dH\u03c6(\u03c6(X)), replacing X by s\u2217(X) with s\u2217 \u2208 S\u2217 leaves the optimal ROC curve un-\ntouched. The following corollary is straightforward.\nCorollary 1 For any s \u2208 S\u2217, we have: sups\u2208S |AUC(s) \u2212 1/2| = AUC(s\u2217) \u2212 1/2.\nConsequently, the homogeneity testing problem may be seen as closely related to the problem of\nestimating the optimal AUC\u2217, since it may be re-formulated as follows:\n\n\u201dH0 : AUC\u2217 = 1/2\u201d versus \u201dH1 : AUC\u2217 > 1/2\u201d.\n\nKnowing how a single optimal scoring function s\u2217 \u2208 S\u2217 ranks observations drawn from a mixture\nof G and H is suf\ufb01cient for detecting departure from the homogeneity hypothesis in an optimal fash-\nion, the MWW statistic computed from the (s\u2217(X +\nj ))\u2019s being an asymptotically ef\ufb01cient\nestimate of AUC\u2217 and thus yields an asymptotically (locally) uniformly most powerful test.\nLet F (dx) = pG(dx) + (1 \u2212 p)H(dx) and denote by Fs(dt) the image of the distribution F by\ns \u2208 S. Notice that, for any s\u2217 \u2208 S\u2217, the scoring function S\u2217 = Fs\u2217 \u25e6 s\u2217 is still optimal and the\nscore variable S\u2217(X) is uniformly distributed on [0, 1] under the mixture distribution F (in addition,\nit may be easily shown to be independent from s\u2217 \u2208 S\u2217). Observe in addition that AUC\u2217 \u2212 1/2\nmay be viewed as the Earth Mover\u2019s distance between the class distributions HS\u2217 and GS\u2217 for this\n\u201dnormalization\u201d:\n\ni ), s\u2217(X\u2212\n\n(cid:90) 1\n\nAUC\u2217 \u2212 1/2 =\n\n{HS\u2217(t) \u2212 GS\u2217(t)} dt.\n\nt=0\n\nEmpirical AUC maximization. A natural way of inferring the value of AUC\u2217 and/or selecting\na scoring function \u02c6s with AUC nearly as large as AUC\u2217 is to maximize an empirical version of\nthe AUC criterion over a set S0 of scoring function candidates. We assume that the class S0 is\nsuf\ufb01ciently rich in order to guarantee that the bias AUC\u2217 \u2212 sups\u2208S0 AUC(s) is small, and its com-\nplexity is controlled (when measured for instance by the VC dimension of the collection of sets\n{{x \u2208 X : s(x) \u2265 t}, (s, t) \u2208 S0 \u00d7 R} as in [7] or by the order of magnitude of conditional\nRademacher averages as in [6]). We recall that, under such assumptions, universal consistency\nresults have been established for empirical AUC maximizers, together with distribution-free gener-\nalization bounds, see [2, 6] for instance. We point out that this approach can be extended to other\nrelevant ranking criteria. The contours of a theory guaranteeing the statistical performance of the\nERM approach for empirical risk functionals de\ufb01ned by R-estimates have been sketched in [8].\n\n5\n\n\f\u201cPn1\n\ni=1\n\n\u22121\n1\n\nwhere bF\u02c6s(t) = N\nwherecWn1,m1 =Pn1\n\nbRi = N1 \u02c6S(X +\n\nI{\u02c6s(X +\n\nn0+i) for 1 \u2264 i \u2264 n1,\n\u2212\n\nn0+i) \u2264 t} +Pm1\ncWn1,m1 \u2265 Qn1,m1 (\u03b1),\n\nI{\u02c6s(X\n\nj=1\n\n2. Rank-sum Wilcoxon test. Reject the homogeneity hypothesis H0 when:\n\nm0+j) \u2264 t}\u201d\n\nand \u02c6S = bF\u02c6s \u25e6 \u02c6s.\n\n4 The two-stage testing procedure\n\n. . . , X +\n\nn0} \u222a {X\u2212\n1 ,\n\nthe \ufb01rst data set Dn0,m0 =\ninto two subsamples:\nAssume that data have been split\n{X +\nm0} will be used for deriving a scoring function on X and\n. . . , X\u2212\n1 ,\nm0+m1} will serve to\nn0+n1} \u222a {X\u2212\nn1,m1 = {X +\nthe second data set D(cid:48)\ncompute a pseudo- two-sample Wilcoxon test statistic from the ranked data. We set N0 = n0 + m0\nand N1 = n1 + m1 and suppose that ni/Ni \u2192 p as ni and mi tend to in\ufb01nity for i \u2208 {0, 1}.\nLet \u03b1 \u2208 (0, 1). The testing procedure at level \u03b1 is then performed in two steps, as follows.\n\nm0+1, . . . , X\u2212\n\nn0+1, . . . , X +\n\nSCORE-BASED RANK-SUM WILCOXON TEST\n\n1. Ranking. From dataset Dn0,m0, perform empirical AUC maximization over S0 \u2282 S, yielding\nthe scoring function \u02c6s(x) = \u02c6sn0,m0 (x). Compute the ranks of data with positive labels among\nthe sample D(cid:48)\n\nn1,m1, once sorted by increasing order of magnitude of their score:\n\ni=1 bRi and Qn1,m1 (\u03b1) denotes the (1\u2212\u03b1)-quantile of distribution Wn1,m1.\n\nThe next result shows that the learning step does not affect the consistency property, provided it\noutputs a universally consistent scoring rule.\nTheorem 2 Let \u03b1 \u2208 (0, 1/2) and suppose that the ranking/scoring method involved at step 1 yields\na universally consistent scoring rule \u02c6s in the AUC sense. The score-based rank-sum Wilcoxon test\n\n\u03a6 = I(cid:110)(cid:99)Wn1,m1 \u2265 Qn1,m1(\u03b1)\n\n(cid:111)\n\nis universally consistent as ni and mi tend to \u221e for i \u2208 {0, 1} at level \u03b1, in the following sense.\n\n1. It is of level \u03b1 for all ni and mi, i \u2208 {0, 1}: P(H,H) {\u03a6 = +1} \u2264 \u03b1 for any H(dx).\n2. Its power converges to 1 as ni and mi, i \u2208 {0, 1}, tend to in\ufb01nity for every alternative:\n\nlimni, mi\u2192\u221e P(G,H) {\u03a6 = +1} = 1 for every pair of distinct distributions (G, H).\n\nRemark 2 (CONVERGENCE RATES.) Under adequate complexity assumptions on the set S0 over\nwhich empirical AUC maximization or one of its variants is performed, distribution-free rate bounds\nfor the generalization ability of scoring rules may be established in terms of AUC, see Corollary 6 in\n[2] or Corollary 3 in [6]. As shown by a careful examination of Theorem 2, this permits to derive a\nconvergence rate for the decay of the score-based type II error of MWW under any given alternative\n\u221a\n(G, H), when combined with the Berry-Esseen theorem for two-sample U-statistics. For instance,\nN0 rate bound holds for \u02c6s(x), one may show that choosing N1 \u223c N0 then yields a\n\u221a\nif a typical 1/\nrate of order OP(G,H)(1/\nRemark 3 (INFINITE-DIMENSIONAL FEATURE SPACE.) We point out that the method presented\nhere is by no means restricted to the case where X is of \ufb01nite dimension, but may be applied to\nfunctional input data, provided an AUC-consistent ranking procedure can be applied in this context.\n\nN0).\n\n5 Numerical examples\n\nThe procedure proposed above is extremely simple once the delicate AUC maximization stage is\nperformed. A stunning property is the fact that critical thresholds are set automatically, with no ref-\nerence to the data. We \ufb01rts consider a low-dimensional toy experiment and display some numerical\nresults. Two independent i.i.d. samples of equal size m = n = N/2 have been generated from\ntwo conditional 4-dimensional gaussian distributions on the hypercube [\u22122, 2]4. Their parameters\n\n6\n\n\fare denoted by \u00b5+ and \u00b5\u2212 for the means and \u0393 is their common covariance matrix. Three cases\nhave been considered. The \ufb01rst example corresponds to a homogeneous situation: \u00b5+ = \u00b5\u2212 = \u00b51\nwhere \u00b51 = (\u22120.96,\u22120.83, 0.29,\u22121.34) and the upper diagonals of \u03931 are (6.52, 3.84, 4.72, 3.1),\n(\u22121.89, 3.56, 1.52), (\u22123.2, 0.2) and (\u22122.6). In the second example, we test homogeneity under an\nalternative, \u201dfairly far\u201d from H0, where \u00b5\u2212 = \u00b51, \u00b5+ = (0.17,\u22120.24, 0.04,\u22121.02) and \u0393 as before.\nEventually, the third example corresponds to a much more dif\ufb01cult problem, \u201dclose\u201d to H0, where\n\u00b5\u2212 = (1.19,\u22121.20,\u22120.02,\u22120.16), \u00b5+ = (1.08,\u22121.18,\u22120.1,\u22120.06) and the upper diagonals of\n\u0393 are (1.83, 6.02, 0.69, 4.99), (\u22120.65,\u22120.31, 1.03), (\u22120.54,\u22120.03) and (\u22121.24). The dif\ufb01culty of\neach of these examples is illustrated by Fig. 2 in terms of (optimal) ROC curve. The table in Fig.\n2 gives Monte-Carlo estimates of the power of three testing procedures when \u03b1 = 0.05 (averaged\nover B = 150 replications): 1) the score-based MWW test, where ranking is performed using the\nscoring function output by a run of the TREERANK algorithm [9] on a training sample Dn0,m0, 2)\nthe LDA-based Student test sketched in Remark 1 and 3) a bootstrap version of the MMD-test with\na Gaussian RBF Kernel proposed in [1].\n\nDataSet\nEx. 1\nEx. 2\n\nEx. 3\n\nSample size (m0,m1)\n\nLDA-Student\n\n(500,500)\n(500,500)\n(2000,1000)\n(3000,2000)\n\n6%\n99%\n75%\n98%\n\nScore-based MWW MMD\n5%\n99%\n30%\n65%\n\n1%\n99%\n45%\n73%\n\nFigure 1: Powers and ROC curves describing the \u201ddistance\u201d to H0 for each situation: example 1\n(red), example 2 (black) and example 3 (blue).\n\nIn the second series of experimental results, gaussian distributions with same covariance matrix\non Rd are generated, with larger values for the input space dimension d \u2208 {10, 30}. We\nconsidered several problems at given toughness. The increasing dif\ufb01culty of the testing problems\nconsidered is controlled through the euclidian distance between the means \u2206\u00b5 = ||\u00b5+ \u2212 \u00b5\u2212|| and\nis described by Fig. 2, which depicts the related ROC curves, corresponding to situations where\n\u2206\u00b5 \u2208 {0.2, 0.1, 0.08, 0.05}. On these examples, we compared the performance of four methods at\nlevel \u03b1 = 0.05: the score-based MWW test, where ranking is again performed using the scoring\nfunction output by a run of the TREERANK algorithm on a training sample Dn0,m0, the KFDA\ntest proposed in [23], a bootstrap version of the MMD-test with a Gaussian RBF Kernel (M M D)\nand another version, with moment matching to Pearson curves (M M Dmom), using also with a\nGaussian RBF kernel (see [1]). Monte-Carlo estimates of the corresponding powers are given in the\nTable displayed in Fig. 2.\n\n6 Conclusion\n\nWe have provided a sound strategy, involving a preliminary bipartite ranking stage, to extend clas-\nsical approaches for testing homogeneity based on ranks to a multidimensional setup. Consistency\nof the extended version of the popular MWW test has been established, under the assumption of\nuniversal consistency of the ranking method in the AUC sense. This principle can be applied to\nother R-statistics, standing as natural criteria for the bipartite ranking problem [8]. Beyond the illus-\ntrative preliminary simulation example displayed in this paper, we intend to investigate the relative\nef\ufb01ciency of such tests with respect to other tests standing as natural candidates in this setup.\n\nAppendix - Proof of Theorem 2\n\nObserve that, conditioned upon the \ufb01rst sample Dn0,m0, the statistic(cid:99)Wn1,m1 is distributed according\nto Wn1,m1 under the null hypothesis. For any distribution H, we thus have: \u2200\u03b1 \u2208 (0, 1/2),\n\n(cid:110)(cid:99)Wn1,m1 > Qn1,m1(\u03b1) | Dn0,m0\n\n(cid:111) \u2264 \u03b1.\n\nP(H,H)\n\nTaking the expectation, we obtain that the test is of level \u03b1 for all n, m.\n\n7\n\n\fDim. d M M Dboot M M Dmom Kfda\n\nSc.based MWW\n\nd = 10\nd = 30\n\nd = 10\nd = 30\n\nd = 10\nd = 30\n\nd = 10\nd = 30\n\ncase 1 :\u2206\u00b5 = 0.2\n\n86%\n58%\n\n64%\n36%\n\ncase 1 :\u2206\u00b5 = 0.1\n\n20%\n7%\n\n20%\n15%\n\ncase 3 :\u2206\u00b5 = 0.08\n\n19%\n7%\n\n16%\n9%\n\ncase 4 :\u2206\u00b5 = 0.05\n\n13%\n6%\n\n13%\n8%\n\n86%\n54%\n\n20%\n9%\n\n19%\n5%\n\n11%\n6%\n\n90%\n85%\n\n58%\n47%\n\n42%\n32%\n\n18%\n16%\n\nFigure 2: Power estimates and ROC curves describing the \u201ddistance\u201d to H0 for each situation: case\n1 (black), case 2 (blue), case 3 (green) and case 4 (red).\n\nFor any s \u2208 S, denote by Un1,m1(s) the empirical AUC of s evaluated on the sample D(cid:48)\nRecall \ufb01rst that it follows from the two-sample U-statistic theorem (see [20]) that:\n\nn1,m1.\n\ni+n0)) \u2212 E[Hs(s(X+\n\nj+m0)) \u2212 E[Gs(s(X\u2212\n\n1 ))](cid:9)\n1 ))](cid:9) + oP(G,H)(1),\n\n\u221a\nN{Un1,m1(s) \u2212 AUC(s)} =\n\n\u221a\nN1\nn1\n\u221a\nN1\nm1\n\n\u2212\n\nn1(cid:88)\n(cid:8)Hs(s(X+\nm1(cid:88)\n(cid:8)Gs(s(X\u2212\n\ni=1\n\nj=1\n\n\u221a\nN{Un1,m1(s) \u2212 AUC(s)} is asymptotically normal with limit variance \u03c32\n1 )))/p+Var(Gs(s(X\u2212\n\nas n, m tend to in\ufb01nity. In particular, for any pair of distributions (G, H), the centered random\ns(G, H) =\nvariable\ns(H, H) = 1/(12p(1\u2212\nVar(Hs(s(X +\np)) for any s \u2208 S such that the distribution Hs(dt) is continuous. Refer to Theorem 12.4 in [21] for\nfurther details.\n(G, H), so that AUC\u2217 > 1/2. Setting (cid:98)Un1,m1 = Un1,m1(\u02c6s) and decomposing AUC\u2217 \u2212(cid:98)Un1,m1 as\nWe now place ourselves under an alternative hypothesis described by a pair of distinct distribution\nthe sum of the de\ufb01cit of AUC of \u02c6s(x), AUC\u2217\u2212AUC(\u02c6s) namely, and the deviation AUC(\u02c6s)\u2212(cid:98)Un1,m1\n\n1 )))/(1\u2212p) under P(G,H). Notice that \u03c32\n\n(cid:110)(cid:99)Wn1,m1 \u2264 Qn1,m1(\u03b1)\n(cid:111)\n\nmay\n\nn1,m1, type II error of \u03a6 given by P(G,H)\n\nevaluated on the sample D(cid:48)\nbe bounded by:\n\nP(G,H)\n\nN1\n\nwhere\n\n(cid:110)(cid:112)\n(cid:16)(cid:98)Un1,m1 \u2212 AUC((cid:98)s)\n\u0001n1,m1(\u03b1) =(cid:112)\n\n(cid:17) \u2264 \u0001n1,m1(\u03b1)\n(cid:111)\n(cid:110)(cid:112)\nN1 (AUC((cid:98)s) \u2212 AUC\u2217) \u2264 \u0001n1,m1(\u03b1)\n(cid:18) Qn1,m1(\u03b1)\n\u2212 1\n2\n\nconverges to z\u03b1/(cid:112)12p(1 \u2212 p). Now, the fact that type II error of \u03a6 converges to zero as ni and\n\n\u2212 n1 + 1\n2m1\n\u221a\nN1(Qn1,m1(\u03b1)/(n1m1) \u2212 (n1 + 1)/(2m1))\nObserve that, by virtue of the CLT recalled above,\nmi tend to \u221e for i \u2208 {0, 1} immediately follows from the assumption in regards to the AUC of\n\u02c6s(x) universal consistency and the CLT for two-sample U-statistics combined with the theorem of\ndominated convergence. Due to space limitations, details are omitted.\n\nN1(AUC\u2217 \u2212 1\n2\n\n+ P(G,H)\n\n(cid:19)\n\n\u2212(cid:112)\n\n(cid:111)\n\n,\n\nn1m1\n\nN1\n\n).\n\n8\n\n\fReferences\n[1] M.J. Rasch B. Scholkopf A. Smola A. Gretton, K.M. Borgwardt. A kernel method for the two-sample\nproblem. In Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007.\n[2] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the area under\n\nthe ROC curve. J. Mach. Learn. Res., 6:393\u2013425, 2005.\n\n[3] G. Biau and L. Gyor\ufb01. On the asymptotic properties of a nonparametric l1-test statistic of homogeneity.\n\nIEEE Transactions on Information Theory, 51(11):3965\u20133973, 2005.\n\n[4] Y.K. Cheung and J.H. Klotz. The Mann Whitney Wilcoxon distribution using linked list. Statistica Sinica,\n\n7:805\u2013813, 1997.\n\n[5] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and scoring using empirical risk minimization. In\nP. Auer and R. Meir, editors, Proceedings of COLT 2005, volume 3559 of Lecture Notes in Computer\nScience, pages 1\u201315. Springer, 2005.\n\n[6] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and empirical risk minimization of U-statistics. The\n\nAnnals of Statistics, 36(2):844\u2013874, 2008.\n\n[7] S. Cl\u00b4emenc\u00b8on and N. Vayatis. Ranking the best instances. Journal of Machine Learning Research,\n\n8:2671\u20132699, 2007.\n\n[8] S. Cl\u00b4emenc\u00b8on and N. Vayatis. Empirical performance maximization based on linear rank statistics. In\nAdvances in Neural Information Processing Systems, volume 3559 of Lecture Notes in Computer Science,\npages 1\u201315. Springer, 2009.\n\n[9] S. Cl\u00b4emenc\u00b8on and N. Vayatis. Tree-based ranking methods. IEEE Transactions on Information Theory,\n\n55(9):4316\u20134336, 2009.\n\n[10] W.W. Cohen, R.E. Schapire, and Y. Singer. Learning to order things. In NIPS \u201997: Proceedings of the\n1997 conference on Advances in neural information processing systems 10, pages 451\u2013457, Cambridge,\nMA, USA, 1998. MIT Press.\n\n[11] C. Cortes and M. Mohri. AUC optimization vs. error rate minimization.\n\nIn S. Thrun, L. Saul, and\nB. Sch\u00a8olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge,\nMA, 2004.\n\n[12] C. Ferri, P.A. Flach, and J. Hern\u00b4andez-Orallo. Learning decision trees using the area under the roc curve.\nIn ICML \u201902: Proceedings of the Nineteenth International Conference on Machine Learning, pages 139\u2013\n146, 2002.\n\n[13] Y. Freund, R. D. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[14] S. Kotz and S. Nadarajah. Multivariate t Distributions and Their Applications. Cambridge University\n\nPress, 2004.\n\n[15] E.L. Lehmann and J. P. Romano. Testing Statistical Hypotheses. Springer, 2005.\n[16] H.B. Mann and D.R. Whitney. On a test of whether one of two random variables is stochastically larger\n\nthan the other. Ann. Math. Stat., 18:50\u201360, 1947.\n\n[17] A. Rachev. Probability Metrics and the Stability of Stochastic Models. Wiley, 1991.\n[18] A. Rakotomamonjy. Optimizing Area Under Roc Curve with SVMs. In Proceedings of the First Workshop\n\non ROC Analysis in AI, 2004.\n\n[19] C. Rudin, C. Cortes, M. Mohri, and R. E. Schapire. Margin-based ranking and boosting meet in the\nmiddle. In P. Auer and R. Meir, editors, Proceedings of COLT 2005, volume 3559 of Lecture Notes in\nComputer Science, pages 63\u201378. Springer, 2005.\n\n[20] R.J. Ser\ufb02ing. Approximation theorems of mathematical statistics. Wiley, 1980.\n[21] A.K. van der Vaart. Asymptotic Analysis. Cambridge University Press, 1998.\n[22] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80\u201383, 1945.\n[23] E. Moulines Z. Harchaoui, F. Bach. Testing for homogeneity with kernel Fischer discriminant analysis.\n\nIn Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008.\n\n9\n\n\f", "award": [], "sourceid": 340, "authors": [{"given_name": "Nicolas", "family_name": "Vayatis", "institution": null}, {"given_name": "Marine", "family_name": "Depecker", "institution": null}, {"given_name": "St\u00e9phan", "family_name": "Cl\u00e9men\u00e7con", "institution": null}]}