{"title": "Online F-Measure Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 595, "page_last": 603, "abstract": "The F-measure is an important and commonly used performance metric for binary prediction tasks. By combining precision and recall into a single score, it avoids disadvantages of simple metrics like the error rate, especially in cases of imbalanced class distributions. The problem of optimizing the F-measure, that is, of developing learning algorithms that perform optimally in the sense of this measure, has recently been tackled by several authors. In this paper, we study the problem of F-measure maximization in the setting of online learning. We propose an efficient online algorithm and provide a formal analysis of its convergence properties. Moreover, first experimental results are presented, showing that our method performs well in practice.", "full_text": "Online F-Measure Optimization\n\nR\u00b4obert Busa-Fekete\n\nDepartment of Computer Science\nUniversity of Paderborn, Germany\n\nbusarobi@upb.de\n\nBal\u00b4azs Sz\u00a8or\u00b4enyi\n\nTechnion, Haifa, Israel /\n\nMTA-SZTE Research Group on\nArti\ufb01cial Intelligence, Hungary\n\nKrzysztof Dembczy\u00b4nski\n\nInstitute of Computing Science\n\nPozna\u00b4n University of Technology, Poland\nkdembczynski@cs.put.poznan.pl\n\nszorenyibalazs@gmail.com\n\nEyke H\u00a8ullermeier\n\nDepartment of Computer Science\nUniversity of Paderborn, Germany\n\neyke@upb.de\n\nAbstract\n\nThe F-measure is an important and commonly used performance metric for bi-\nnary prediction tasks. By combining precision and recall into a single score, it\navoids disadvantages of simple metrics like the error rate, especially in cases of\nimbalanced class distributions. The problem of optimizing the F-measure, that\nis, of developing learning algorithms that perform optimally in the sense of this\nmeasure, has recently been tackled by several authors. In this paper, we study\nthe problem of F-measure maximization in the setting of online learning. We\npropose an ef\ufb01cient online algorithm and provide a formal analysis of its conver-\ngence properties. Moreover, \ufb01rst experimental results are presented, showing that\nour method performs well in practice.\n\n1\n\nIntroduction\n\n=\n\nBeing rooted in information retrieval [16], the so-called F-measure is nowadays routinely used as a\n\nbinary labels y = (y1, . . . , yt), the F-measure is de\ufb01ned as\n\nperformance metric in various prediction tasks. Given predictions\ufffdy = (\ufffdy1, . . . ,\ufffdyt) \u2208 {0, 1}t of t\n\n(1)\n\nF (y,\ufffdy) =\n\n2\ufffdt\n2 \u00b7 precision(y,\ufffdy) \u00b7 recall(y,\ufffdy)\ni=1 yi\ufffdyi\nprecision(y,\ufffdy) + recall(y,\ufffdy) \u2208 [0, 1] ,\n\ufffdt\ni=1 yi +\ufffdt\ni=1\ufffdyi\nwhere precision(y,\ufffdy) = \ufffdt\ni=1 yi\ufffdyi/\ufffdt\ni=1\ufffdyi, recall(y,\ufffdy) = \ufffdt\ni=1 yi\ufffdyi/\ufffdt\n\ni=1 yi, and where\n0/0 = 1 by de\ufb01nition. Compared to measures like the error rate in binary classi\ufb01cation, maximizing\nthe F-measure enforces a better balance between performance on the minority and majority class;\ntherefore, it is more suitable in the case of imbalanced data. Optimizing for such an imbalanced\nmeasure is very important in many real-world applications where positive labels are signi\ufb01cantly\nless frequent than negative ones. It can also be generalized to a weighted harmonic average of pre-\ncision and recall. Yet, for the sake of simplicity, we stick to the unweighted mean, which is often\nreferred to as the F1-score or the F1-measure.\nGiven the importance and usefulness of the F-measure [19, 20, 18, 10, 11], it is natural to look for\nlearning algorithms that perform optimally in the sense of this measure. However, optimizing the\nF-measure is a quite challenging problem, especially because the measure is not decomposable over\nthe binary predictions. This problem has received increasing attention in recent years and has been\ntackled by several authors [18, 9, 10, 11]. However, most of this work has been done in the standard\nsetting of batch learning.\n\n1\n\n\fIn this paper, we study the problem of F-measure optimization in the setting of online learning\n[4, 2], which is becoming increasingly popular in machine learning. In fact, there are many applica-\ntions in which training data is arriving progressively over time, and models need to be updated and\nmaintained incrementally. In our setting, this means that in each round t the learner \ufb01rst outputs a\n\n1. \ufb01rst an instance xt \u2208 X is observed by the learner,\n\nprediction\ufffdyt and then observes the true label yt. Formally, the protocol in round t is as follows:\n2. then the predicted label\ufffdyt for xt is computed on the basis of the \ufb01rst t instances (x1, . . . , xt),\nthe t\u2212 1 labels (y1, . . . , yt\u22121) observed so far, and the corresponding predictions (\ufffdy1, . . . ,\ufffdyt\u22121),\n\n3. \ufb01nally, the label yt is revealed to the learner.\nThe goal of the learner is then to maximize\n\n(2)\n\nF(t) = F ((y1, . . . , yt), (\ufffdy1, . . . ,\ufffdyt))\n\nover time. Optimizing the F-measure in an online fashion is challenging mainly because of the\n\nnon-decomposability of the measure, and the fact that the\ufffdyt cannot be changed after round t.\n\nAs a potential application of online F-measure optimization consider the recommendation of news\nfrom RSS feeds or tweets [1]. Besides, it is worth mentioning that online methods are also relevant\nin the context of big data and large-scale learning, where the volume of data, despite being \ufb01nite,\nprevents from processing each data point more than once [21, 7]. Treating the data as a stream, online\nalgorithms can then be used as single-pass algorithms. Note, however, that single-pass algorithms\nare evaluated only at the end of the training process, unlike online algorithms that are supposed to\nlearn and predict simultaneously.\nWe propose an online algorithm for F-measure optimization, which is not only very ef\ufb01cient but also\neasy to implement. Unlike other methods, our algorithm does not require extra validation data for\ntuning a threshold (that separates between positive and negative predictions), and therefore allows\nthe entire data to be used for training. We provide a formal analysis of the convergence properties\nof our algorithm and prove its statistical consistency under different assumptions on the learning\nprocess. Moreover, \ufb01rst experimental results are presented, showing that our method performs well\nin practice.\n\n2 Formal Setting\n\n\u03c1(x,1)\n\n\u03b7(x) d\u00b5(x).\n\nIn this paper, we consider a stochastic setting in which (x1, y1), . . . , (xt, yt) are assumed to be i.i.d.\nsamples from some unknown distribution \u03c1(\u00b7) on X \u00d7 Y, where Y = {0, 1} is the label space and\nX is some instance space. We denote the marginal distribution of the feature vector X by \u00b5(\u00b7).1\nThen, the posterior probability of the positive class, i.e., the conditional probability that Y = 1\n\u03c1(x,0)+\u03c1(x,1) . The prior distribution of class 1 can\ngiven X = x, is \u03b7(x) = P(Y = 1| X = x) =\nbe written as \u03c01 = P(Y = 1) =\ufffdx\u2208X\nLet B = {f : X \u2212\u2192 {0, 1}} be the set of all binary classi\ufb01ers over the set X . The F-measure of a\nbinary classi\ufb01er f \u2208 B is calculated as\n2\ufffdX\n\u03b7(x) d\u00b5(x) +\ufffdX\n\nAccording to [19], the expected value of (1) converges to F (f ) with t \u2192 \u221e when f is used to\n\ncalculate\ufffdy, i.e.,\ufffdyt = f (xt). Thus, limt\u2192\u221e E\ufffdF\ufffd(y1, . . . , yt), (f (x1), . . . , f (xt))\ufffd\ufffd = F (f ).\n\nNow, let G = {g : X \u2212\u2192 [0, 1]} denote the set of all probabilistic binary classi\ufb01ers over the set\nX , and let T \u2286 B denote the set of binary classi\ufb01ers that are obtained by thresholding a classi\ufb01er\ng \u2208 G\u2014that is, classi\ufb01ers of the form\n\nE [\u03b7(X)] + E [f (X)]\n\n\u03b7(x)f (x) d\u00b5(x)\n\n2E [\u03b7(X)f (X)]\n\nf (x) d\u00b5(x)\n\n=\n\nF (f ) =\n\n\ufffdX\n\n.\n\nis true and 0 otherwise.\n\nfor some threshold \u03c4 \u2208 [0, 1], where\ufffd\u00b7\ufffd is the indicator function that evaluates to 1 if its argument\n\n1X is assumed to exhibit the required measurability properties.\n\ng\u03c4 (x) =\ufffdg(x) \u2265 \u03c4\ufffd\n\n(3)\n\n2\n\n\fAccording to [19], the optimal F-score computed as maxf\u2208B F (f ) can be achieved by a thresholded\nclassi\ufb01er. More precisely, let us de\ufb01ne the thresholded F-measure as\n\nF (\u03c4 ) = F (\u03b7\u03c4 ) =\n\nThen the optimal threshold \u03c4\u2217 can be obtained as\n\n2\ufffdX\n\u03b7(x)\ufffd\u03b7(x) \u2265 \u03c4\ufffd d\u00b5(x)\n\u03b7(x) d\u00b5(x) +\ufffdX\ufffd\u03b7(x) \u2265 \u03c4\ufffd d\u00b5(x)\n\n\ufffdX\n\n\u03c4\u2217 = argmax\n0\u2264\u03c4\u22641\n\nF (\u03c4 ) .\n\n=\n\n2E [\u03b7(X)\ufffd\u03b7(X) \u2265 \u03c4\ufffd]\nE [\u03b7(X)] + E [\ufffd\u03b7(X) \u2265 \u03c4\ufffd]\n\n(4)\n\n(5)\n\nClearly, for the classi\ufb01er in the form of (3) with g(x) = \u03b7(x) and \u03c4 = \u03c4\u2217, we have F (g\u03c4 ) = F (\u03c4\u2217).\nThen, as shown by [19] (see their Theorem 4), the performance of any binary classi\ufb01er f \u2208 B\ncannot exceed F (\u03c4\u2217), i.e., F (f ) \u2264 F (\u03c4\u2217) for all f \u2208 B. Therefore, estimating posteriors \ufb01rst and\nadjusting a threshold afterward appears to be a reasonable strategy. In practice, this seems to be the\nmost popular way of maximizing the F-measure in a batch mode; we call it the 2-stage F-measure\nmaximization approach, or 2S for short. More speci\ufb01cally, the 2S approach consists of two steps:\n\ufb01rst, a classi\ufb01er is trained for estimating the posteriors, and second, a threshold is tuned on the\nposterior estimates. For the time being, we are not interested in the training of this classi\ufb01er but\nfocus on the second step, that is, the labeling of instances via thresholding posterior probabilities.\ni=1 of labeled instances are given as training\nFor doing this, suppose a \ufb01nite set DN = {(xi, yi)}N\nare provided by a classi\ufb01er g \u2208 G. Next, one might de\ufb01ne the F-score obtained by applying the\nthreshold classi\ufb01er g\u03c4 on the data DN as follows:\n\ufffdN\ni=1 yi\ufffd\u03c4 \u2264 g(xi)\ufffd\n\ufffdN\ni=1 yi +\ufffdN\ni=1\ufffd\u03c4 \u2264 g(xi)\ufffd\n\ninformation. Moreover, suppose estimates \ufffdpi = g(xi) of the posterior probabilities pi = \u03b7(xi)\n\nIn order to \ufb01nd an optimal threshold \u03c4N \u2208 argmax0\u2264\u03c4\u22641 F (\u03c4 ; g,DN ), it suf\ufb01ces to search the\n\ufb01nite set {\ufffdp1, . . . ,\ufffdpN}, which requires time O(N log N ). In [19], it is shown that F (\u03c4 ; g,DN ) P\u2212\u2192\nF (g\u03c4 ) as N \u2192 \u221e for any \u03c4 \u2208 (0, 1), and [11] provides an even stronger result: If a classi\ufb01er gDN\nis induced from DN by an L1-consistent learner,2 and a threshold \u03c4N is obtained by maximizing (6)\n) P\u2212\u2192 F (\u03c4\u2217) as N \u2212\u2192 \u221e (under mild assumptions on the\non an independent set D\ufffdN , then F (g\u03c4N\nDN\ndata distribution).\n\nF (\u03c4 ; g,DN ) =\n\n(6)\n\n3 Maximizing the F-Measure on a Population Level\n\nIn this section we assume that the data distribution is known. According to the analysis in the\nprevious section, optimizing the F-measure boils down to \ufb01nding the optimal threshold \u03c4\u2217. At this\npoint, an observation is in order.\nRemark 1. In general, the function F (\u03c4 ) is neither convex nor concave. For example, when X is\n\ufb01nite, then the denominator and enumerator of (4) are step functions, whence so is F (\u03c4 ). Therefore,\ngradient methods cannot be applied for \ufb01nding \u03c4\u2217.\n\nNevertheless, \u03c4\u2217 can be found based on a recent result of [20], who show that \ufb01nding the root of\n\nh(\u03c4 ) =\ufffdx\u2208X\n\nmax (0, \u03b7(x) \u2212 \u03c4 ) d\u00b5(x) \u2212 \u03c4 \u03c01\n\n(7)\n\nis a necessary and suf\ufb01cient condition for optimality. Note that h(\u03c4 ) is continuos and strictly de-\ncreasing, with h(0) = \u03c01 and h(1) = \u2212\u03c01. Therefore, h(\u03c4 ) = 0 has a unique solution which\nis \u03c4\u2217. Moreover, [20] also prove an interesting relationship between the optimal threshold and the\nF-measure induced by that threshold: F (\u03c4\u2217) = 2\u03c4\u2217.\nThe marginal distribution of the feature vectors, \u00b5(\u00b7), induces a distribution \u03b6(\u00b7) on the posteriors:\nderivative of d\u00b5\nd\u03b6 , and \u03b6(p) the density of observing an instance x for which the probability of the\n2A learning algorithm, viewed as a map from samples DN to classi\ufb01ers gDN , is called L1-consistent w.r.t.\n\n\u03b6(p) =\ufffdx\u2208X\ufffd\u03b7(x) = p\ufffd d\u00b5(x)for all p \u2208 [0, 1]. By de\ufb01nition,\ufffd\u03b7(x) = p\ufffd is the Radon-Nikodym\nthe data distribution \u03c1 if limN\u2192\u221e PDN \u223c\u03c1\ufffd\ufffdx\u2208X |gDN (x) \u2212 \u03b7(x)| d\u00b5(x) > \ufffd\ufffd = 0 for all \ufffd > 0.\n\n3\n\n\f0\n\npositive label is p. We shall write concisely d\u03bd(p) = \u03b6(p) dp. Since \u03bd(\u00b7) is an induced probability\nmeasure, the measurable transformation allows us to rewrite the notions introduced above in terms\nof \u03bd(\u00b7) instead of \u00b5(\u00b7)\u2014see, for example, Section 1.4 in [17]. For example, the prior probability\n\ufffdX\n\u03b7(x) d\u00b5 can be written equivalently as\ufffd 1\nh(\u03c4 ) =\ufffd 1\nmax (0, p \u2212 \u03c4 ) d\u03bd(p) \u2212 \u03c4\ufffd 1\np d\u03bd(p) \u2212 \u03c4\ufffd\ufffd 1\n=\ufffd 1\n1 d\u03bd(p) +\ufffd 1\n\np d\u03bd(p) =\ufffd 1\np d\u03bd(p)\ufffd\n\n0 p d\u03bd(p). Likewise, (7) can be rewritten as follows:\n\np \u2212 \u03c4 d\u03bd(p) \u2212 \u03c4\ufffd 1\n\nEquation (8) will play a central role in our analysis. Note that precise knowledge of \u03bd(\u00b7) suf\ufb01ces to\n\ufb01nd the maxima of F (\u03c4 ). This is illustrated by two examples presented in Appendix E, in which we\nassume speci\ufb01c distributions for \u03bd(\u00b7), namely uniform and Beta distributions.\n4 Algorithmic Solution\n\np d\u03bd(p)\n\n(8)\n\n0\n\n0\n\n0\n\n\u03c4\n\n\u03c4\n\n\u03c4\n\n\ufffdpt = gt(xt) of the probability \u03b7(xt). We would like to stress again that the focus of our analysis is\n\nIn this section, we provide an algorithmic solution to the online F-measure maximization problem.\nFor this, we shall need in each round t some classi\ufb01er gt \u2208 G that provides us with some estimate\non optimal thresholding instead of classi\ufb01er learning. Thus, we assume the sequence of classi\ufb01ers\ng1, g2, . . . to be produced by an external online learner, for example, logistic regression trained by\nstochastic gradient descent.\nAs an aside, we note that F-measure maximization is not directly comparable with the task that\nis most often considered and analyzed in online learning, namely regret minimization [4]. This is\nmainly because the F-measure is a non-decomposable performance metric. In fact, the cumulative\n\noutcome yt [11]. In the case of the F-measure, the score F(t), and therefore the optimal prediction\n\nregret is a summation of a per-round regret rt, which only depends on the prediction\ufffdyt and the true\n\ufffdyt, depends on the entire history, that is, all observations and decisions made by the learner till time\n\nt. This is discussed in more detail in Section 6.\nThe most naive way of forecasting labels is to implement online learning as repeated batch learning,\ni=1 in each time step t. Obviously,\nthat is, to apply a batch learner (such as 2S) to Dt = {(xi, yi)}t\nhowever, this strategy is prohibitively expensive, as it requires storage of all data points seen so\nfar (at least in mini-batches), as well as optimization of the threshold \u03c4t and re-computation of the\nclassi\ufb01er gt on an ever growing number of examples.\nIn the following, we propose a more principled technique to maximize the online F-measure. Our\napproach is based on the observation that h(\u03c4\u2217) = 0 and h(\u03c4 )(\u03c4 \u2212 \u03c4\u2217) < 0 for any \u03c4 \u2208 [0, 1] such\nthat \u03c4 \ufffd= \u03c4\u2217 [20]. Moreover, it is a monotone decreasing continuous function. Therefore, \ufb01nding\nthe optimal threshold \u03c4\u2217 can be viewed as a root \ufb01nding problem. In practice, however, h(\u03c4 ) is not\n\nknown and can only be estimated. Let us de\ufb01ne h\ufffd\u03c4, y,\ufffdy\ufffd = y\ufffdy\u2212\u03c4 (y +\ufffdy) . For now, assume \u03b7(x)\nto be known and write concisely\ufffdh(\u03c4 ) = h(\u03c4, y,\ufffd\u03b7(x) \u2265 \u03c4\ufffd). We can compute the expectation of\n\ufffdh(\u03c4 ) with respect to the data distribution for a \ufb01xed threshold \u03c4 as follows:\np +\ufffdp \u2265 \u03c4\ufffd d\u03bd(p)\ufffd\n1 d\u03bd(p)\ufffd = h(\u03c4 )\n\nE\ufffd\ufffdh(\u03c4 )\ufffd = E [h(\u03c4, y,\ufffd\u03b7(x) \u2265 \u03c4\ufffd)] = E [y\ufffd\u03b7(x) \u2265 \u03c4\ufffd \u2212 \u03c4 (y +\ufffd\u03b7(x) \u2265 \u03c4\ufffd)]\n\np\ufffdp \u2265 \u03c4\ufffd d\u03bd(p) \u2212 \u03c4\ufffd\ufffd 1\np d\u03bd(p) \u2212 \u03c4\ufffd\ufffd 1\np d\u03bd(p) +\ufffd 1\n\n=\ufffd 1\n=\ufffd 1\n\nThus, an unbiased estimate of h(\u03c4 ) can be obtained by evaluating\ufffdh(\u03c4 ) for an instance x. This\nsuggests designing a stochastic approximation algorithm that is able to \ufb01nd the root of h(\u00b7) similarly\nto the Robbins-Monro algorithm [12]. Exploiting the relationship between the optimal F-measure\nand the optimal threshold, F (\u03c4\u2217) = 2\u03c4\u2217, we de\ufb01ne the threshold in time step t as\n\n(9)\n\n0\n\n0\n\n0\n\n\u03c4\n\n\u03c4\n\n\u03c4t =\n\n1\n2\n\nF(t) =\n\nat\nbt\n\nwhere at =\n\nbt =\n\nyi +\n\nt\ufffdi=1\n\nt\ufffdi=1\ufffdyi .\n\nyi\ufffdyi,\n\nt\ufffdi=1\n\n4\n\n(10)\n\n\fWith this threshold, the \ufb01rst differences between thresholds, i.e. \u03c4t+1\u2212 \u03c4t, can be written as follows.\n\nProposition 2. If thresholds \u03c4t are de\ufb01ned according to (10) and\ufffdyt+1 as\ufffd\u03b7(xt+1) > \u03c4t\ufffd, then\n\n(11)\nThe proof of Prop. 2 is deferred to Appendix A. According to (11), the method we obtain \u201calmost\u201d\ncoincides with the update rule of the Robbins-Monro algorithm. There are, however, some notable\ndifferences. In particular, the sequence of coef\ufb01cients, namely the values 1/bt+1, does not consist\nof prede\ufb01ned real values converging to zero (as fast as 1/t). Instead, it consists of random quantities\n\n(\u03c4t+1 \u2212 \u03c4t)bt+1 = h(\u03c4t, yt+1,\ufffdyt+1) .\n\nthat depend on the history, namely the observed labels y1, . . . , yt and the predicted labels\ufffdy1, . . . ,\ufffdyt.\nMoreover, these \u201ccoef\ufb01cients\u201d are not independent of h(\u03c4t, yt+1,\ufffdyt+1) either.\n\nIn spite of these\nadditional dif\ufb01culties, we shall present a convergence analysis of our algorithm in the next section.\nThe pseudo-code of our online F-measure op-\ntimization algorithm, called Online F-measure\nOptimizer (OFO), is shown in Algorithm 1.\nThe forecast rule can be written in the form of\n\nObserve the instance xt\n\nAlgorithm 1 OFO\n1: Select g0 from B, and set \u03c40 = 0\n2: for t = 1 \u2192 \u221e do\n3:\n4:\n\ufffdpt \u2190 gt\u22121(xt)\n5:\n\ufffdyt \u2190\ufffd\ufffdpt \u2265 \u03c4t\u22121\ufffd\nObserve label yt\n6:\nCalculate F(t) = 2at\n7:\nbt\ngt \u2190 A(gt\u22121, xt, yt) \ufffd update the classi\ufb01er\n8:\n9: return \u03c4T\n\n\ufffd estimate posterior\n\ufffd current prediction\n\nde\ufb01ned in (10) and pt = \u03b7(xt). In practice, we\n\n\ufffdyt =\ufffdpt \u2265 \u03c4t\u22121\ufffd for xt where the threshold is\nuse \ufffdpt = gt\u22121(xt) as an estimate of the true\n\nposterior pt.\nIn line 8 of the code, an online\nlearner A : G \u00d7 X \u00d7 Y \u2212\u2192 G is assumed,\nwhich produces classi\ufb01ers gt by incrementally\nupdating the current classi\ufb01er with the newly\nobserved example, i.e., gt = A(gt\u22121, xt, yt).\nIn our experimental study, we shall test and compare various state-of-the-art online learners as pos-\nsible choices for A.\n5 Consistency\n\nand \u03c4t = at\nbt\n\nIn this section, we provide an analysis of the online F-measure optimizer proposed in the previous\nsection. More speci\ufb01cally, we show the statistical consistency of the OFO algorithm: The sequence\nof online thresholds and F-scores produced by this algorithm converge, respectively, to the optimal\nthreshold \u03c4\u2217 and the optimal thresholded F-score F (\u03c4\u2217) in probability. As a \ufb01rst step, we prove this\nresult under the assumption of knowledge about the true posterior probabilities; then, in a second\nstep, we consider the case of estimated posteriors.\nTheorem 3. Assume the posterior probabilities pt = \u03b7(xt) of the positive class to be known in each\nstep of the online learning process. Then, the sequences of thresholds \u03c4t and online F-scores F(t)\nproduced by OFO both converge in probability to their optimal values \u03c4\u2217 and F (\u03c4\u2217), respectively:\n\nFor any \ufffd > 0, we have limt\u2192\u221e P\ufffd|\u03c4t \u2212 \u03c4\u2217| > \ufffd\ufffd = 0 and limt\u2192\u221e P\ufffd|F(t) \u2212 F (\u03c4\u2217)| > \ufffd\ufffd = 0.\nHere is a sketch of the proof of this theorem, the details of which can be found in the supplementary\nmaterial (Appendix B):\n\u2022 We focus on {\u03c4t}\u221et=1, which is a stochastic process the \ufb01ltration of which is de\ufb01ned as\nFt = {y1, . . . , yt,\ufffdy1, . . . ,\ufffdyt}. For this \ufb01ltration, one can show that\ufffdh(\u03c4t) is Ft-measurable\nand E\ufffd\ufffdh(\u03c4t)|Ft\ufffd = h(\u03c4t) based on (9).\nbt+1\ufffdh(\u03c4t)\ufffd\ufffd\ufffdFt\ufffd =\n\u2022 As a \ufb01rst step, we can decompose the update rule given in (11) as follows: E\ufffd 1\nt\ufffd conditioned on the \ufb01ltration Ft (see Lemma 7).\nbt+2 h(\u03c4t) + O\ufffd 1\n\u2022 Next, we show that the sequence 1/bt behaves similarly to 1/t, in the sense that\ufffd\u221et=1 E\ufffd1/b2\nt\ufffd <\n\u221e (see Lemma 8). Moreover, one can show that\ufffd\u221et=1 E [1/bt] \u2265\ufffd\u221et=1\n\u2022 Although h(\u03c4 ) is not differentiable on [0, 1] in general (it can be piecewise linear, for example),\none can show that its \ufb01nite difference is between \u22121 \u2212 \u03c01 and \u2212\u03c01 (see Proposition 9 in the\nappendix). As a consequence of this result, our process de\ufb01ned in (11) does not get stuck even\nclose to \u03c4\u2217.\n\n2t = \u221e.\n\nb2\n\n1\n\n1\n\n\u2022 The main part of the proof is devoted to analyzing the properties of the sequence of \u03b2t =\nE\ufffd(\u03c4t \u2212 \u03c4\u2217)2\ufffdfor which we show that limt\u2192\u221e \u03b2t = 0, which is suf\ufb01cient for the statement\n\n5\n\n\fof the theorem. Our proof follows the convergence analysis of [12]. Nevertheless, our analysis\nessentially differs from theirs, since in our case, the coef\ufb01cients cannot be chosen freely. In-\nstead, as explained before, they depend on the labels observed and predicted so far. In addition,\nthe noisy estimation of h(\u00b7) depends on the labels, too, but the decomposition step allows us to\nhandle this undesired effect.\nRemark 4. In principle, the Robbins-Monro algorithm can be applied for \ufb01nding the root of h(\u00b7)\nas well. This yields an update rule similar to (11), with 1/bt+1 replaced by C/t for a constant\nC > 0. In this case, however, the convergence of the online F-measure is dif\ufb01cult to analyze (if at\nall), because the empirical process cannot be written in a nice form. Moreover, as it has been found\nin the analysis, the coef\ufb01cient C should be set \u2248 1/\u03c01 (see Proposition 9 and the choice of {kt} at\nthe end of the proof of Theorem 3). Yet, since \u03c01 is not known beforehand, it needs to be estimated\nfrom the samples, which implies that the coef\ufb01cients are not independent of the noisy evaluations\nof h(\u00b7)\u2014just like in the case of the OFO algorithm. Interestingly, OFO seems to properly adjust\nthe values 1/bt+1 in an adaptive manner (bt is a sum of two terms, the \ufb01rst of which is t\u03c01 in\nexpectation), which is a very nice property of the algorithm. Empirically, based on synthetic data,\nwe found the performance of the original Robbins-Monro algorithm to be on par with OFO.\n\nAs already announced, we are now going to relax the assumption of known posterior probabilities\n\npt = \u03b7(xt). Instead, estimates\ufffdpt = gt(xt) \u2248 pt of these probabilities are obtained by classi\ufb01ers gt\n\nthat are provided by the external online learner in Algorithm 1. More concretely, assume an online\nlearner A : G \u00d7 X \u00d7 Y \u2212\u2192 G, where G is the set of probabilistic classi\ufb01ers. Given a current model\ngt and a new example (xt, yt), this learner produces an updated classi\ufb01er gt+1 = A(gt, xt, yt).\nShowing a consistency result for this scenario requires some assumptions on the online learner.\nWith this formal de\ufb01nition of online learner, a statistical consistency result similar to Theorem 3\ncan be shown. The proof of the following theorem is again deferred to supplementary material\n(Appendix C).\nTheorem 5. Assume that the classi\ufb01ers (gt)\u221et=1 in the OFO framework are provided by an online\n\nlearner for which the following holds: There is a \u03bb > 0 such that E\ufffd\ufffdx\u2208X |\u03b7(x) \u2212 gt(x)| d\u00b5(x)\ufffd =\nO(t\u2212\u03bb) . Then, F(t)\nThis theorem\u2019s requirement on the online learner is stronger than what is assumed by [11] and\nrecalled in Footnote 2. First, the learner is trained online and not in a batch mode. Second, we also\nrequire that the L1 error of the learner goes to 0 with a convergence rate of order t\u2212\u03bb.\nIt might be interesting to note that a universal rate of convergence cannot be established without\nassuming regularity properties of the data distribution, such as smoothness via absolute continuity.\nResults of that kind are beyond the scope of this study. Instead, we refer the reader to [5, 6] for\ndetails on L1 consistency and its connection to the rate of convergence.\n\nP\u2192 F (\u03c4\u2217) and \u03c4t\n\nP\u2192 \u03c4\u2217.\n\n6 Discussion\n\nRegret optimization and stochastic approximation: Stochastic approximation algorithms can be ap-\nplied for \ufb01nding the optimum of (4) or, equivalently, to \ufb01nd the unique root of (8) based on noisy\nevaluations\u2014the latter formulation is better suited for the classic version of the Robbins-Monro root\n\ufb01nding algorithm [12]. These algorithms are iterative methods whose analysis focuses on the dif-\nference of F (\u03c4t) from F (\u03c4\u2217), where \u03c4t denotes the estimate of \u03c4\u2217 in iteration t, whereas our online\n\nprediction for yi in round i. This difference is crucial because F (\u03c4t) only depends on \u03c4t and in\naddition, if \u03c4t is close to \u03c4\u2217 then F (\u03c4t) is also close to F (\u03c4\u2217) (see [19] for concentration proper-\n\ndifferent from F (\u03c4\u2217) even if the current estimate \u03c4t is close to \u03c4\u2217 in case the number of previous\nincorrect predictions is large.\nIn online learning and online optimization it is common to work with the notion of (cumulative)\n\nsetting is concerned with the distance of F ((y1, . . . , yt), (\ufffdy1, . . . ,\ufffdyt)) from F (\u03c4\u2217), where\ufffdyi is the\nties), whereas in the online F-measure optimization setup, F ((y1, . . . , yt), (\ufffdy1, . . . ,\ufffdyt)) can be very\nregret. In our case, this notion could be interpreted either as\ufffdt\ni=1 |F ((y1, . . . , yi), (\ufffdy1, . . . ,\ufffdyi)) \u2212\nF (\u03c4\u2217)| or as\ufffdt\ni=1 |yi \u2212\ufffdyi|. After division by t, the former becomes the average accuracy of the\nbecause |F ((y1, . . . , yi), (\ufffdy1, . . . ,\ufffdyi)) \u2212 F (\u03c4\u2217)| itself is an aggregate measure of our performance\n\nF-measure over time and the latter the accuracy of our predictions. The former is hard to interpret\n\n6\n\n\fTable 1: Main statistics of the benchmark datasets and one pass F-scores obtained by OFO and 2S\nmethods on various datasets. The bold numbers indicate when the difference is signi\ufb01cant between\nthe performance of OFO and 2S methods. The signi\ufb01cance level is set to one sigma that is estimated\nbased on the repetitions.\n\nDataset\ngisette\nnews20.bin\nReplab\nWebspamUni\nepsilon\ncovtype\nurl\nSUSY\nkdda\nkddb\n\n#pos\n3500\n9997\n10797\n\nLearner:\n#neg #features\n3500\n5000\n9999 1355191\n353754\n34874\n254\n212189 137811\n2000\n249778 250222\n297711 283301\n54\n792145 1603985 3231961\n18\n\n#instances\n7000\n19996\n45671\n350000\n500000\n581012\n2396130\n5000000 2287827 2712173\n8918054 7614730 1303324 20216830 0.927 0.926\n\n20012498 17244034 2768464 29890095\n\nLogReg\n\nPegasos\n\nPerceptron\n\n2S OFO\n\nOFO\n2S\nOFO\n2S\n0.950 0.935 0.935 0.920\n0.954 0.955\n0.879 0.883 0.908 0.930\n0.879 0.876\n0.924 0.923\n0.926 0.928 0.914 0.914\n0.912 0.918 0.914 0.910 0.927 0.912\n0.884 0.886 0.862 0.872\n0.878 0.872\n0.754 0.760 0.732 0.719\n0.761 0.762\n0.962 0.963\n0.951 0.950 0.971 0.972\n0.762 0.762 0.754 0.745 0.710 0.720\n0.921 0.926 0.913 0.927\n0.934 0.934 0.930 0.929 0.923 0.928\n\nover the \ufb01rst t rounds, which thus makes no sense to aggregate again. The latter, on the other hand,\n\ndiffers qualitatively from our ultimate goal; in fact, |F ((y1, . . . , yt), (\ufffdy1, . . . ,\ufffdyt)) \u2212 F (\u03c4\u2217)| is the\n\nalternate measure that we are aiming to optimize for instead of the accuracy.\nOnline optimization of non-decomposable measures: Online optimization of the F-measure can be\nseen as a special case of optimizing non-decomposable loss functions as recently considered by [9].\nTheir framework essentially differs from ours in several points. First, regarding the data generation\nprocess, the adversarial setup with oblivious adversary is assumed, unlike our current study where\na stochastic setup is assumed. From this point of view, their assumption is more general since\nthe oblivious adversary captures the stochastic setup. Second, the set of classi\ufb01ers is restricted to\ndifferentiable parametric functions, which may not include the F-measure maximizer. Therefore,\ntheir proof of vanishing regret does in general not imply convergence to the optimal F-score. Seen\nfrom this point of view, their result is weaker than our proof of consistency (i.e., convergence to\nthe optimal F-measure in probability if the posterior estimates originate from a consistent learner).\nFinally, there are some other non-decomposable performance measures which are intensively used\nin many practical applications. Their optimization had already been investigated in the online or\none-pass setup. The most notable such measure might be the area under the ROC curve (AUC)\nwhich had been investigated in an online learning framework by [21, 7].\n\n7 Experiments\n\nIn this section, the performance of the OFO algorithm is evaluated in a one-pass learning scenario\non benchmark datasets, and compared with the performance of the 2-stage F-measure maximization\nappraoch (2S) described in Section 2. We also assess the rate of convergence of the OFO algorithm\nin a pure online learning setup.3\nThe online learner A in OFO was implemented in different ways, using Logistic Regression\n(LOGREG), the classical Perceptron algorithm (PERCEPTRON) [13] and an online linear SVM called\nPEGASOS [14]. In the case of LOGREG, we applied the algorithm introduced in [15] which handles\nL1 and L2 regularization. The hyperparameters of the methods and the validation procedures are\ndescribed below and in more detail in Appendix D. If necessary, the raw outputs of the learners were\nturned into valid probability estimates, i.e., they were rescaled to [0, 1] using logistic transform.\nWe used in the experiments nine datasets taken from the LibSVM repository of binary classi\ufb01cation\ntasks.4 Many of these datasets are commonly used as benchmarks in information retrieval where the\nF-score is routinely applied for model selection. In addition, we also used the textual data released\nin the Replab challenge of identifying relevant tweets [1]. We generated the features used by the\nwinner team [8]. The main statistics of the datasets are summarized in Table 1.\n\n3Additional results of experiments conducted on synthetic data are presented in Appendix F.\n4\nhttp://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/binary.html\n\n7\n\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nr\no\nc\ns\n\u2212\nF\n \n)\ne\nn\n\ni\nl\n\nn\nO\n\n(\n\n0\n \n0\n10\n\nSUSY\n\n \n\nOne\u2212pass+LogReg\nOnline+LogReg\nOne\u2212pass+Pegasos\nOnline+Pegasos\nOne\u2212pass+Perceptron\nOnline+Peceptron\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nr\no\nc\ns\n\u2212\nF\n \n)\ne\nn\n\ni\nl\n\nn\nO\n\n(\n\nWebspamUni\n\n \n\nOne\u2212pass+LogReg\nOnline+LogReg\nOne\u2212pass+Pegasos\nOnline+Pegasos\nOne\u2212pass+Perceptron\nOnline+Peceptron\n\n2\n10\n\n4\n10\n\n6\n10\n\nNum. of samples\n\n0\n \n0\n10\n\n1\n10\n\n2\n10\nNum. of samples\n\n3\n10\n\n4\n10\n\n5\n10\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nr\no\nc\ns\n\u2212\nF\n \n)\ne\nn\n\ni\nl\n\nn\nO\n\n(\n\n0\n \n0\n10\n\nkdda\n\n \n\nOne\u2212pass+LogReg\nOnline+LogReg\nOne\u2212pass+Pegasos\nOnline+Pegasos\nOne\u2212pass+Perceptron\nOnline+Peceptron\n\n2\n10\n\n4\n10\n\n6\n10\n\nNum. of samples\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nr\no\nc\ns\n\u2212\nF\n \n)\ne\nn\n\ni\nl\n\nn\nO\n\n(\n\n0\n \n0\n10\n\nurl\n\n \n\nOne\u2212pass+LogReg\nOnline+LogReg\nOne\u2212pass+Pegasos\nOnline+Pegasos\nOne\u2212pass+Perceptron\nOnline+Peceptron\n\n2\n10\n\n4\n10\n\nNum. of samples\n\n6\n10\n\nFigure 1: Online F-scores obtained by OFO algorithm on various dataset. The dashed lines represent\nthe one-pass performance of the OFO algorithm from Table 1 which we considered as baseline.\n\nOne-pass learning.\nIn one-pass learning, the learner is allowed to read the training data only\nonce, whence online learners are commonly used in this setting. We run OFO along with the three\nclassi\ufb01ers trained on 80% of the data. The learner obtained by OFO is of the form g\u03c4t\nt , where t\nis the number of training samples. The rest 20% of the data was used to evaluate g\u03c4t\nin terms of\nt\nthe F-measure. We run every method on 10 randomly shuf\ufb02ed versions of the data and averaged\nresults. The means of the F-scores computed on the test data are shown in Table 1. As a baseline,\nwe applied the 2S approach. More concretely, we trained the same set of learners on 60% of the\ndata and validated the threshold on 20% by optimizing (6). Since both approaches are consistent,\nthe performance of OFO should be on par with the performance of 2S. This is con\ufb01rmed by the\nresults, in which signi\ufb01cant differences are observed in only 7 of 30 cases. These differences in\nperformance might be explained by the \ufb01niteness of the data. The advantage of our approach over\n2S is that there is no need of validation and the data needs to be read only once, therefore it can\nbe applied in a pure one-pass learning scenario. The hyperparameters of the learning methods are\nchosen based on the performance of 2S. We tuned the hyperparameters in a wide range of values\nwhich we report in Appendix D.\n\nOnline learning. The OFO algorithm has also been evaluated in the online learning scenario in\nterms of the online F-measure (2). The goal of this experiment is to assess the convergence rate of\nOFO. Since the optimal F-measure is not known for the datasets, we considered the test F-scores\nreported in Table 1. The results are plotted in Figure 1 for four benchmark datasets (the plots for\nthe remaining datasets can be found in Appendix G). As can be seen, the online F-score converges\nto the test F-score obtained in one-pass evalaution in almost every case. There are some exceptions\nin the case of PEGASOS and PERCEPTRON. This might be explained by the fact that SVM-based\nmethods as well as the PERCEPTRON tend to produce poor probability estimates in general (which\nis a main motivation for calibration methods turning output scores into valid probabilities [3]).\n\n8 Conclusion and Future Work\n\nThis paper studied the problem of online F-measure optimization. Compared to many conven-\ntional online learning tasks, this is a speci\ufb01cally challenging problem, mainly because of the non-\ndecomposable nature of the F-measure. We presented a simple algorithm that converges to the\noptimal F-score when the posterior estimates are provided by a sequence of classi\ufb01ers whose L1\nerror converges to zero as fast as t\u2212\u03bb for some \u03bb > 0. As a key feature of our algorithm, we note\nthat it is a purely online approach; moreover, unlike approaches such as 2S, there is no need for a\nhold-out validation set in batch mode. Our promising results from extensive experiments validate\nthe empirical ef\ufb01cacy of our algorithm.\nFor future work, we plan to extend our online optimization algorithm to a broader family of complex\nperformance measures which can be expressed as ratios of linear combinations of true positive, false\npositive, false negative and true negative rates [10]; the F-measure also belongs to this family. More-\nover, going beyond consistency, we plan to analyze the rate of convergence of our OFO algorithm.\nThis might be doable thanks to several nice properties of the function h(\u03c4 ). Finally, an intriguing\nquestion is what can be said about the case when some bias is introduced because the classi\ufb01er gt\ndoes not converge to \u03b7.\nAcknowledgments. Krzysztof Dembczy\u00b4nski is supported by the Polish National Science Centre\nunder grant no. 2013/09/D/ST6/03917. The research leading to these results has received funding\nfrom the European Research Council under the European Union\u2019s Seventh Framework Programme\n(FP/2007-2013) / ERC Grant Agreement n. 306638.\n\n8\n\n\fReferences\n[1] E. Amig\u00b4o, J. C. de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Mart\u00b4\u0131n-Wanton, E. Meij,\nM. de Rijke, and D. Spina. Overview of RepLab 2013: Evaluating online reputation monitoring\nsystems. In CLEF, volume 8138, pages 333\u2013352, 2013.\n\n[2] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[3] R. Busa-Fekete, B. K\u00b4egl, T. \u00b4Eltet\u02ddo, and Gy. Szarvas. Tune and mix: Learning to rank using\n\nensembles of calibrated multi-class classi\ufb01ers. Machine Learning, 93(2\u20133):261\u2013292, 2013.\n\n[4] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, 2006.\n\n[5] L. Devroye and L. Gy\u00a8or\ufb01. Nonparametric Density Estimation: The L1 View. Wiley, NY, 1985.\n[6] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer,\n\nNY, 1996.\n\n[7] W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass AUC optimization. In ICML, volume 30:3,\n\npages 906\u2013914, 2013.\n\n[8] V. Hangya and R. Farkas. Filtering and polarity detection for reputation management on tweets.\n\nIn Working Notes of CLEF 2013 Evaluation Labs and Workshop, 2013.\n\n[9] P. Kar, H. Narasimhan, and P. Jain. Online and stochastic gradient methods for non-\n\ndecomposable loss functions. In NIPS, 2014.\n\n[10] N. Nagarajan, S. Koyejo, R. Ravikumar, and I. Dhillon. Consistent binary classi\ufb01cation with\n\ngeneralized performance metrics. In NIPS, pages 2744\u20132752, 2014.\n\n[11] H. Narasimhan, R. Vaish, and Agarwal S. On the statistical consistency of plug-in classi\ufb01ers\n\nfor non-decomposable performance measures. In NIPS, 2014.\n\n[12] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400\u2013\n\n407, 1951.\n\n[13] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization\n\nin the brain. Psychological Review, 65(6):386\u2013408, 1958.\n\n[14] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor SVM. In ICML, pages 807\u2013814, 2007.\n\n[15] Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Stochastic gradient descent training for L1-\n\nregularized log-linear models with cumulative penalty. In ACL, pages 477\u2013485, 2009.\n\n[16] C.J. van Rijsbergen. Foundation and evalaution. Journal of Documentation, 30(4):365\u2013373,\n\n1974.\n\n[17] S. R. S. Varadhan. Probability Theory. New York University, 2000.\n[18] W. Waegeman, K. Dembczy\u00b4nski, A. Jachnik, W. Cheng, and E. H\u00a8ullermeier. On the Bayes-\noptimality of F-measure maximizers. Journal of Machine Learning Research, 15(1):3333\u2013\n3388, 2014.\n\n[19] N. Ye, K. M. A. Chai, W. S. Lee, and H. L. Chieu. Optimizing F-measure: A tale of two\n\napproaches. In ICML, 2012.\n\n[20] M. Zhao, N. Edakunni, A. Pocock, and G. Brown. Beyond Fano\u2019s inequality: Bounds on the\noptimal F-score, BER, and cost-sensitive risk and their implications. JMLR, pages 1033\u20131090,\n2013.\n\n[21] P. Zhao, S. C. H. Hoi, R. Jin, and T. Yang. Online AUC maximization.\n\n233\u2013240, 2011.\n\nIn ICML, pages\n\n9\n\n\f", "award": [], "sourceid": 418, "authors": [{"given_name": "R\u00f3bert", "family_name": "Busa-Fekete", "institution": "UPB"}, {"given_name": "Bal\u00e1zs", "family_name": "Sz\u00f6r\u00e9nyi", "institution": "The Technion / University of Szeged"}, {"given_name": "Krzysztof", "family_name": "Dembczynski", "institution": "Poznan University of Technology"}, {"given_name": "Eyke", "family_name": "H\u00fcllermeier", "institution": "Marburguniversity"}]}