{"title": "Confusion-Based Online Learning and a Passive-Aggressive Scheme", "book": "Advances in Neural Information Processing Systems", "page_first": 3284, "page_last": 3292, "abstract": "This paper provides the first ---to the best of our knowledge--- analysis of online learning algorithms for multiclass problems when the {\\em confusion} matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and more precisely to matrix martingales. We do establish generalization bounds for online learning algorithm and show how the theoretical study motivate the proposition of a new confusion-friendly learning procedure. This learning algorithm, called \\copa (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for \\copa can be computed analytically, thus allowing the user from having to recours to any optimization package to implement it.", "full_text": "Confusion-Based Online Learning and a\n\nPassive-Aggressive Scheme\n\nQARMA, Laboratoire d\u2019Informatique Fondamentale de Marseille\n\nLiva Ralaivola\n\nAix-Marseille University, France\n\nliva.ralaivola@lif.univ-mrs.fr\n\nAbstract\n\nThis paper provides the \ufb01rst \u2014to the best of our knowledge\u2014 analysis of online\nlearning algorithms for multiclass problems when the confusion matrix is taken\nas a performance measure. The work builds upon recent and elegant results on\nnoncommutative concentration inequalities, i.e. concentration inequalities that\napply to matrices, and, more precisely, to matrix martingales. We do establish\ngeneralization bounds for online learning algorithms and show how the theoretical\nstudy motivates the proposition of a new confusion-friendly learning procedure.\nThis learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a\npassive-aggressive learning algorithm; it is shown that the update equations for\nCOPA can be computed analytically and, henceforth, there is no need to recourse\nto any optimization package to implement it.\n\n1\n\nIntroduction\n\nThis paper aims at promoting an infrequent way to tackle multiclass prediction problems: we ad-\nvocate for the use of the confusion matrix \u2014the matrix which reports the probability of predicting\nclass q for an instance of class p for all potential label pair (p, q)\u2014 as the objective \u2018function\u2019 to\nbe optimized. This way, we step aside the more widespread viewpoint of relying on the misclassi-\n\ufb01cation rate \u2014the probability of misclassifying a point\u2014 as a performance measure for multiclass\npredictors. There are obvious reasons for taking this perspective, among which we may name the fol-\nlowing. First, the confusion information is a \ufb01ner-grain information than the misclassi\ufb01cation rate,\nas it allows one to precisely identify the types of error made by a classi\ufb01er. Second, the confusion\nmatrix is independent of the class distributions, since it reports conditional probability distributions:\na consequence is that a predictor learned to achieve a \u2018small\u2019 confusion matrix will probably be in-\nsensitive to class imbalance and it will also be robust to changes in class prior distributions between\ntrain and test data. Finally, there are many application domains such as medicine, bioinformatics,\ninformation retrieval, where the confusion matrix (or an estimate thereof) is precisely the object of\ninterest for an expert who wants to assess the relevance of an automatic prediction procedure.\n\nContributions. We essentially provide two contributions. On the one hand, we provide a statistical\nanalysis for the generalization ability of online learning algorithms producing predictors that aim at\noptimizing the confusion. This requires us to introduce relevant statistical quantities that are taken\nadvantage of via a concentration inequality for matrix martingales proposed by [8]. Motivated by our\nstatistical analysis, we propose an online learning algorithm from the family of passive aggressive\nlearning algorithms [2]: this algorithm is inherently designed to optimize the confusion matrix and\nnumerical simulations are provided that show it reaches its goal.\n\nOutline of the paper. Section 2 formalizes our pursued objective of targetting a small confu-\nsion error. Section 3 provides our results regarding the ability of online confusion-aware learning\n\n1\n\n\fprocedures to achieve a small confusion together with the update equations for COPA, a new passive-\naggressive learning procedure designed to control the confusion risk. Section 4 reports numerical\nresults that should be viewed as a proof of concept for the effectiveness of our algorithm.\n\n2 Problem and Motivation\n\n2.1 Notation\nWe focus on the problem of multiclass classi\ufb01cation: the input space is denoted by X and the target\nspace is Y = {1, . . . , Q}. The training sequence Z = {Zt = (Xt, Yt)}T\nt=1 is made of T identically\nand independently random pairs Zt = (Xt, Yt) distributed according to some unknown distribution\nD over X \u00d7 Y. The sequence of input data will be referred to as X = {Xt}T\nt=1 and the sequence\nof corresponding labels Y = {Yt}T\nt=1. We may write that Z is distributed according to DT =\n\u2297T\nt=1D. Z1:t denotes the subsequence Z1:t = {(X\u03c4 , Y\u03c4 )}t\n\u03c4 =1. We use DX|y for the conditional\ndistribution on X given that Y = y; therefore, for a given sequence y = (y1, . . . , yT ) \u2208 Y T ,\nDX|y = \u2297T\nt=1DX|yt is the distribution of the random sample X = {X1, . . . , XT} over X T such\nthat Xt is distributed according to DX|yt. E[\u00b7] and EX|y respectively denote the expectation with\nrespect to D and DX|y.\nFor a sequence y of labels, T (y) = [T1(y)\u00b7\u00b7\u00b7 TQ(y)] \u2208 NQ is such that Tq(y) is the number of\ntimes label q appears in y. Often, we will drop the dependence upon y for T (y). Throughout, we\nmake the simplifying assumption that Tq > 1 for all q \u2014otherwise, our analysis still holds but extra\ncare and notation must be taken for handling classes absent from Z.\nThe space of hypotheses we consider is H (e.g. H \u2286 {f : f : X \u2192 R}Q), and A designates an\nonline learning algorithm that produces hypothesis ht \u2208 H when it encounters a new example Zt.\nFinally, (cid:96) = ((cid:96)q|p)1\u2264p,q\u2264Q is a family of class-dependent loss functionals (cid:96)q|p : H \u00d7 X \u2192 R+. For\na point x \u2208 X of class y \u2208 Y, (cid:96)q|y(h, x) is the cost of h\u2019s favoring class q over y for x.\nExample 1 (Misclassi\ufb01cation Loss). The family of (cost-sensitive) misclassi\ufb01cation losses (cid:96)misclass\nis de\ufb01ned as\nwhere Cpq \u2208 R+, \u2200p, q \u2208 Y and \u03c7[E] = 1 if E is true and 0 otherwise.\nExample 2 (Hinge Loss). The family of multiclass hinge losses (cid:96)hinge is such that, given W =\n\n{w1, . . . , wQ} \u2208 X Q with(cid:80) wq = 0 and hypothesis hW such that hW (x) = [(cid:104)w1, x(cid:105)\u00b7\u00b7\u00b7(cid:104)wQ, x(cid:105)]\n\n.\n= \u03c7[h(x)=q]Cyq,\n\n(cid:96)misclass\nq|y\n\n(h, x)\n\n(1)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:104)wq, x(cid:105) +\n\n(cid:12)(cid:12)(cid:12)(cid:12)+\n\n1\n\nQ \u2212 1\n\n(cid:96)hinge\nq|y (hW , x)\n\n.\n=\n\n, where | \u00b7 |+ = max(0,\u00b7).\n\n(2)\n\n2.2 From the Confusion Matrix to the Confusion Risk, and their Minimization\n\nIn many situations, e.g. class-imbalanced datasets, it is important not to mea-\n\nConfusion matrix.\nsure the quality of a predictor h based upon its classi\ufb01cation rate\n= PXY (h(X) (cid:54)= Y )\n.\n\nR(h)\n\n(3)\nonly; this may lead to erroneous conclusions regarding the quality of h. Indeed, if, for instance,\nsome class q is predominantly present in the data at hand, say P(Y = q) = 1 \u2212 \u03b5, for some small\n\u03b5 > 0, then the predictor hmaj that always outputs hmaj(x) = q regardless of x has a classi\ufb01cation\nerror lower that is at most \u03b5, whereas it never correctly predicts the class of data from classes p (cid:54)= q.\nA more informative object is the confusion matrix C(h) \u2208 RQ\u00d7Q of h, which is de\ufb01ned as:\n\n= (P(h(X) = q)|Y = p)1\u2264p,q\u2264Q .\n.\n\n(4)\nThe nondiagonal entries of C(h) provides the information as to the types of confusion, and their\nprevalence, h makes when predicting the class of some x. Let us now abuse the notation and denote\nC(h) the confusion matrix where the diagonal entries are zeroed. It is straightforward to see, that, if\n\u03c0 = [P(Y = 1)\u00b7\u00b7\u00b7 P(Y = Q)] is the vector of class prior distributions then, the misclassi\ufb01cation\nrate R(h) (cf. (3)) can be retrieved as\n\nC(h)\n\nR(h) = (cid:107)\u03c0C(h)(cid:107)1,\n\n2\n\n\fwhere (cid:107) \u00b7 (cid:107)1 denotes the 1-norm. This says that, with little additional information, the misclassi\ufb01ca-\ntion rate might be obtained from the confusion matrix, while the converse is not true.\nIs is clear that having the confusion matrix C(h) be zero means that h is a perfect predictor. When\nsuch situation is not possible (if the data are corrupted by noise, for instance), a valuable objective\nmight be to look for a classi\ufb01er h having a confusion matrix as close to zero as possible. Here, we\nchoose to measure the closeness to zero of matrices through the operator norm (cid:107) \u00b7 (cid:107), de\ufb01ned as:\n\n(cid:107)M(cid:107) .\n\n(cid:107)M v(cid:107)2\n(cid:107)v(cid:107)2\n\n(5)\nwhere (cid:107) \u00b7 (cid:107)2 is the standard Euclidean norm \u2014(cid:107)M(cid:107) is merely the largest singular value of M. In\naddition to being a valid norm, the operator norm has the following nice property, that will be of\nhelp to us. Given two matrices A = (aij) and B = (bij) of the same dimensions\n\n= max\nv(cid:54)=0\n\n,\n\n0 \u2264 aij \u2264 bij, \u2200i, j \u21d2 (cid:107)A(cid:107) \u2264 (cid:107)B(cid:107).\n\n(6)\n\nGiven the equivalence between norms and the de\ufb01nition of the operator norm, it is easy to see that\n\nR(h) \u2264(cid:112)Q(cid:107)C(h)(cid:107),\n\nand targetting a small confusion matrix for h may have the consequence of implying a small mis-\nclassi\ufb01cation risk R(h).\nThe discussion conducted so far brings us to a natural goal of multiclass learning, that of minimizing\nthe norm of the confusion matrix, i.e. that of solving the following optimization problem\n\nh\u2208H(cid:107)C(h)(cid:107).\n\nmin\n\nHowever, addressing this question directly poses a number of problems, both from the statistical and\nalgorithmic sides: a) it is not possible to compute C(h) as it depends on the unknown distribution D\nand b) relying on empirical estimates of C(h) as is folk in statistical learning, requires to deal with\nthe indicator function \u03c7[] that appears in (1), which is not optimization-friendly.\n\nConfusion Risk.\nIn order to deal with the latter problem and prepare the ground for tackling the\nformer one from a theoretical point of view, we now introduce and discuss the confusion risk, which\nis parameterized by a family of loss functions that may be surrogate for the indicator loss \u03c7[].\nDe\ufb01nition 1 (Confusion Risk). The confusion risk C(cid:96)(h) of h is de\ufb01ned as\n\nC(cid:96)(h) =(cid:0)c(cid:96)\n\npq\n\n(cid:1)\n\n1\u2264p,q\u2264Q\n\n\u2208 RQ\u00d7Q, with c(cid:96)\n\npq\n\n.\n=\n\n0\n\nif p (cid:54)= q,\notherwise.\n\n(7)\n\n(cid:26) EX|p(cid:96)q|p(h, X)\n\nObserve that if the family (cid:96)misclass of losses from Example 1 is retained, and Cpq = 1 for all p, q then\nC(cid:96)misclass(h) is precisely the confusion matrix discussed above (with the diagonal set to zero).\nSimilarly as before, the (cid:96)-risk R(cid:96)(h) is de\ufb01ned as R(cid:96)(h)\nThe following lemma directly comes from Equation (6).\nLemma 1. Let h \u2208 H. If 0 \u2264 \u03c7[h(x)(cid:54)=p] \u2264 (cid:96)q|p(h, x) , \u2200x \u2208 X , \u2200p, q \u2208 Y, then (cid:107)C(h)(cid:107) \u2264 (cid:107)C(cid:96)(h)(cid:107).\nThis says that if we recourse to surrogate (cid:96) that upper bounds the 0-1 indicator function then the\nnorm of the confusion risk is always larger than the norm of the confusion matrix.\nArmed with the confusion risk and the last lemma, we may now turn towards the legitimate objective\n\n= (cid:107)\u03c0C(cid:96)(h)(cid:107)1 and R(cid:96)(h) \u2264 \u221a\n\nQ(cid:107)C(cid:96)(h)(cid:107).\n\n.\n\nh\u2208H(cid:107)C(cid:96)(h)(cid:107),\n\nmin\n\na small value of (cid:107)C(cid:96)(h)(cid:107) implying a small value of (cid:107)C(h)(cid:107) (which was our primary goal).\nIt is still impossible to solve this problem because of the unknown distribution D according to which\nexpectations are computed. However, as already evoked, it is possible to derive learning strategies\nbased on empirical estimates of the confusion risk, that ensures (cid:107)C(cid:96)(h)(cid:107) will be small. The next\nsection is devoted to show how this could be done.\n\n3\n\n\f3 Bounds on the Confusion Risk via Online Learning and COPA\n\n(Empirical) Online Confusion Risk\n\n3.1\nAssume an online learning algorithm A that outputs hypotheses from a family H: run on the traning\nsequence Z \u223c DT , A outputs hypotheses h = {h0, . . . , hT}, where h0 is an arbitrary hypothesis.\n\nDe\ufb01nition 2 ((cid:98)L|p(\u00b7,\u00b7) and L|p(\u00b7) matrices). Let (cid:96) = ((cid:96)q|p) be a family of losses and let p \u2208 Y.\n(cid:98)L|p(h, x) = ((cid:98)luv)1\u2264u,v\u2264Q \u2208 RQ\u00d7Q, with(cid:98)luv\n\nFor h \u2208 H and x \u2208 X , we de\ufb01ne\n\n(cid:26) (cid:96)v|u(h, x)\n\nif u = p and v (cid:54)= u,\notherwise.\n\n.\n=\n\n0\n\nThe matrix L|p(h) \u2208 RQ\u00d7Q is naturally given by\n\nL|p(h) = EX|p(cid:98)L|p(h, X).\n\nRemark 1. Note that the only row that may be nonzero in (cid:98)L|p(h, x) and L|p(h) is the pth row.\n\nDe\ufb01nition 3 ((Conditional) Online Confusion Risk). Let y \u2208 Y T be a \ufb01xed sequence of labels and\nY a random sequence of T labels. Let h = {h0, . . . , hT\u22121} be a sequence of T hypotheses.\nThe conditional and non-conditional confusion risks C(cid:96),y(h) and C(cid:96),Y (h) are de\ufb01ned as\n\n(8)\n\n(9)\n\nC(cid:96),y(h)\n\n.\n=\n\nL|yt(ht\u22121),\n\nand, C(cid:96),Y (h)\n\n.\n=\n\n1\nTyt\n\nL|Yt(ht\u22121).\n\n1\nTYt\n\n(10)\n\nRemark 2. C(cid:96),Y (h) is a random variable. In order to provide our generalization bounds, it will be\nmore convenient for us to work with the conditional confusion C(cid:96),y(h). A simple argument will\nthen allow us to provide generalization bounds for C(cid:96),Y (h).\nDe\ufb01nition 4. Let y \u2208 Y T be a \ufb01xed sequence of labels. Let h = {h0, . . . , hT\u22121} be the hypotheses\noutput by A when run on Zy = {(Xt, yt)}T\n\nThe (non-)conditional empirical online confusion risks (cid:98)C(cid:96),y(h, X) and (cid:98)C(cid:96),Y (h, X) are\n(cid:98)C(cid:96),y(h, X)\n(cid:98)L|Yt(ht\u22121, Xt).\n\nt=1, such that Xt is distributed according to DX|yt.\n\n(cid:98)L|yt(ht\u22121, Xt),\n\n(cid:98)C(cid:96),Y (h, X)\n\nT(cid:88)\n\nT(cid:88)\n\nand,\n\n(11)\n\n.\n=\n\n.\n=\n\n1\nTYt\n\nt=1\n\n1\nTyt\n\nt=1\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nWe now are almost ready to provide our results. We just need to introduce a pivotal result that is a\ncorollary of Theorem 7.1 from [8], a proof of which can be found in the appendix.\nCorollary 1 (Rectangular Matrix Azuma). Consider a sequence {Uk} of d1 \u00d7 d2 random matrices,\nand a \ufb01xed sequence of scalars {Mk} such that\n\nThen, for all t > 0,\n\nEUk|U1...Uk\u22121Uk = 0, and (cid:107)Uk(cid:107) \u2264 Mk almost surely.\n\n(cid:110)(cid:13)(cid:13)(cid:13)(cid:88)\n\nP\n\nUk\n\nk\n\n(cid:13)(cid:13)(cid:13) \u2265 t\n(cid:111) \u2264 (d1 + d2) exp\n\n(cid:26)\n\n(cid:27)\n\n\u2212 t2\n2\u03c32\n\n, with \u03c32 .\n=\n\n(cid:88)\n\nk\n\nM 2\nk .\n\n3.2 New Results\n\nThis subsection reports our main results.\nTheorem 1. Suppose that the losses in (cid:96) are such that 0 \u2264 (cid:96)q|p \u2264 M for some M > 0. Fix y \u2208 Y T .\nFor any \u03b4 \u2208 (0; 1], it holds with probability 1 \u2212 \u03b4 over the draw of X \u223c DX|y that\n\n(cid:13)(cid:13)(cid:13)C(cid:96),y(h) \u2212 (cid:98)C(cid:96),y(h, X)\n\n(cid:13)(cid:13)(cid:13) \u2264 M\n\n(cid:115)\n\n(cid:88)\n\n2Q\n\n1\nTp\n\np\n\nlog\n\nQ\n\u03b4\n\n(12)\n\nwhere h = {h0, . . . , hT\u22121} is the set of hypotheses output by A when provided with {(Xt, yt)}T\n\nt=1.\n\n4\n\n\fTherefore, with probability at least 1 \u2212 \u03b4\n\n(cid:107)C(cid:96),y(h)(cid:107) \u2264(cid:13)(cid:13)(cid:13)(cid:98)C(cid:96),y(h, X)\n\n(cid:115)\n\n(cid:13)(cid:13)(cid:13) + M\n\n(cid:88)\n\np\n\n2Q\n\n1\nTp\n\nlog\n\nQ\n\u03b4\n\n.\n\n(13)\n\nProof. The proof is straightforward using Corollary 1. Indeed, consider the random variable\n\n.\n=\n\nUt\n\n1\nTyt\n\nL|yt(ht\u22121) \u2212 1\nTyt\n\nOn the one hand, we observe that:\n\n(cid:98)L|yt(ht\u22121, Xt).\n\nsince EXt|X 1:t\u22121y1:t(cid:98)L|yt(ht\u22121, Xt) = L|yt(ht\u22121). On the other hand, introducing\n\nEXt|X 1:t\u22121,yUt = EXt|X 1:t\u22121,y1:t\n\nUt = 0,\n\n= EXt|yt(cid:96)q|yt(ht\u22121, Xt) \u2212 (cid:96)q|yt(ht\u22121, Xt),\n.\n\n\u2206t,q\n\nwe observe that\n\n(cid:107)Ut(cid:107) = sup\nv:(cid:107)v(cid:107)\u22641\n\n(cid:107)Utv(cid:107)2 =\n\n1\nTyt\n\nsup\n\nv:(cid:107)v(cid:107)\u22641\n\n\u2206t,qvq\n\nq(cid:54)=yt\n\n(cid:12)(cid:12)(cid:12)(cid:88)\n\n(cid:12)(cid:12)(cid:12) =\n\n(cid:115)(cid:88)\n\nq(cid:54)=yt\n\n1\nTyt\n\n(cid:112)Q.\n\nt,q \u2264 M\n\u22062\nTyt\n\nwhere we used that the only row of Ut not to be zero is its ytth row (see Remark 1), the fact that\nsupv:(cid:107)v(cid:107)\u22641 v \u00b7 u = (cid:107)u(cid:107)2 and the assumption 0 \u2264 (cid:96)q|p \u2264 M, which gives that |\u2206t,q| \u2264 M.\nUsing Corollary 1 on the matrix martingale {Ut}, where (cid:107)Ut(cid:107) \u2264 M\n\nQ/Tyt almost surely, gives\n\n\u221a\n\nSetting the right hand side to \u03b4 gives that, with probability at least 1 \u2212 \u03b4\n\nP\n\n(cid:40)\n\nUt\n\nt\n\n(cid:110)(cid:13)(cid:13)(cid:13)(cid:88)\n(cid:13)(cid:13)(cid:13)(cid:88)\n\n(cid:13)(cid:13)(cid:13) \u2265 \u03b5\n(cid:111) \u2264 2Q exp\n(cid:115)\n(cid:13)(cid:13)(cid:13) \u2264 M\n(cid:88)\n\n2Q\n\nUt\n\nt\n\n(cid:41)\n\n2M 2Q(cid:80)\n\n\u03b52\n\n\u2212\n\n1\nT 2\nyt\n\nt\n\n1\nT 2\nyt\n\nt\n\nln\n\n2Q\n\u03b4\n\n.\n\n(cid:80)\nt T \u22122\n\nyt\n\n=(cid:80)\n\np T \u22121\n\np\n\ngives (12). The triangle inequality |(cid:107)A(cid:107) \u2212 (cid:107)B(cid:107)| \u2264 (cid:107)A \u2212 B(cid:107) gives (13).\n\nIf one takes a little step back to fully understand Theorem 1, it may not be as rejoicing as expected.\nIndeed, it provides a bound on the norm of the average confusion risks of hypotheses h0, . . . , hT\u22121,\nwhich, from a practical point of view, does not say much about the norm of the confusion risk\nof a speci\ufb01c hypothesis.\nIn fact, as is usual in online learning [1], it provides a bound on the\nconfusion risk of the Gibbs classi\ufb01er, which uniformly samples over h0, . . . , hT\u22121 before outputting\na prediction. Just as emphasized by [1], things may turn a little bit more enticing when the loss\nfunctions (cid:96) are convex with respect to their \ufb01rst argument, i.e.\n\u2200h, h(cid:48) \u2208 H,\u2200p, q \u2208 Y,\u2200\u03bb \u2208 [0, 1], (cid:96)q|p(\u03bbh + (1\u2212 \u03bb)h(cid:48), x) \u2264 \u03bb(cid:96)q|p(h) + (1\u2212 \u03bb)(cid:96)q|p(h(cid:48), x). (14)\nIn that case, we may show the following theorem, that is much more compatible with the stated goal\nof trying to \ufb01nd a hypothesis that has small (or at least, controlled) confusion risk.\nTheorem 2. In addition to the assumptions of Theorem 1 we now assume that (cid:96) is made of convex\nlosses (as de\ufb01ned in (14)).\nFor any \u03b4 \u2208 (0; 1], it holds with probability 1 \u2212 \u03b4 over the draw of X \u223c DX|y that\n\n(cid:107)C(cid:96) (h)(cid:107) \u2264(cid:13)(cid:13)(cid:13)(cid:98)C(cid:96),y(h, X)\n\n(cid:13)(cid:13)(cid:13) + M\n\n(cid:115)\n\n(cid:88)\n\n2Q\n\n1\nTp\n\np\n\nlog\n\nQ\n\u03b4\n\n, with h\n\n.\n=\n\n1\nT\n\nht\u22121.\n\n(15)\n\nT(cid:88)\n\nt=1\n\nProof. It is a direct consequence of the convexity of (cid:96) combined with Equation (6).\n\n5\n\n\f(cid:80)\n\nIt is now time to give the argument allowing us to state results for the non-conditional online\nIf a random event E(A, B) de\ufb01ned with respect to random variables A and\nconfusion risks.\nB is such that PA|B=b(E(A, b)) \u2265 1 \u2212 \u03b4 for all possible values of b then PAB(E(A, B)) =\nb(1 \u2212 \u03b4)PB(B = b) = 1 \u2212 \u03b4. The results of Theorem 1\n\nPA|B=b(E(A, b))PB(B = b) \u2265 (cid:80)\n\nand Theorem 2 may be therefore stated in terms of Y instead of y.\nIn light of the generic results of this subsection, we are now ready to motivate and derive a new\nonline learning algorithm that aims at a small confusion risk.\n\nb\n\n3.3 Online Learning with COPA\n\nand the question as to how minimize this quantity.\n\nThis subsection presents a new online algorithm, COPA (for COnfusion Passive-Aggressive learn-\ning). Before giving the full detail of our algorithm, we further discuss implications of Theorem 2.\nA \ufb01rst message from Theorem 2 is that, provided the functions (cid:96) considered are convex, it is relevant\nto use the average hypothesis h as a predictor. We indeed know by (15) that the confusion risk of h\n\nis bounded by (cid:107)(cid:98)C(cid:96),y(h, X)(cid:107), plus some quantity directly related to the number of training data en-\ncountered. The second message from the theorem is that the focus naturally comes to (cid:107)(cid:98)C(cid:96),y(h, X)(cid:107)\nAccording to De\ufb01nition 4, (cid:98)C(cid:96),y(h, X) is the sum of\n(cid:98)L|yt(ht\u22121, Xt)/Tyt.\n(cid:98)L|yt(h, Xt)/Tyt with respect to h to get ht, with the hope that the instantaneous risk of ht on\nXt+1 will be small: one may want to minimize the norm of (cid:98)L|yt(h, Xt)/Tyt and pose a problem\n(cid:13)(cid:13)(cid:13)(cid:13) .\n(cid:98)L|yt(h, Xt)\nHowever, as remarked before, (cid:98)L|yt has only one nonzero row, its ytth. Therefore, the operator norm\nof (cid:98)L|yt(h, Xt)/Tyt is simply the Euclidean norm of its ytth row. Trying to minimize the square\n\ninstantaneous confusion matrices\nit does make sense to try to minimize each entry of\n\nlike the following:\n\nIn light of (6),\n\nmin\n\nh\n\nEuclidean norm of this row might be written as\n\n(cid:96)2\nq|yt\n\n(h, Xt).\n\n(16)\n\nW = {w1, . . . , wQ} with(cid:80)\n\nThis last equation is the starting point of COPA. To see how the connection is made, we make some\ninstantiations. The hypothesis space H is made of linear classi\ufb01ers, so that a sequence of vectors\nq wq = 0 de\ufb01nes a hypothesis hW . The family (cid:96) COPA builds upon is\n\nTyt\n\n(cid:13)(cid:13)(cid:13)(cid:13) 1\n(cid:88)\n\nq(cid:54)=yt\n\nmin\n\nh\n\n1\nT 2\nyt\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:104)wq, x(cid:105) +\n\n(cid:12)(cid:12)(cid:12)(cid:12)+\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:104)wq, x(cid:105) +\n\n(cid:88)\n\nq(cid:54)=y\n\n(cid:96)q|p(hW , x) =\n\n1\n\nQ \u2212 1\n\n, \u2200p, q \u2208 Y.\n\nIn other words, COPA is an instance of Example 2. We focus on this setting because it is known that,\nin the batch scenario, it provides Bayes consistent classi\ufb01ers [3, 4]. Given a current set of vectors\nQ}, using (16), and implementing a passive-aggressive learning strategy [2], the\nWt = {wt\nnew set of vectors Wt+1 is searched as the solution of\n\n1, . . . , wt\n\nW,(cid:80)\n\nmin\nq wq=0\n\n1\n2\n\n(cid:13)(cid:13)wq \u2212 wt\n\nq\n\n(cid:13)(cid:13)2\n\n2\n\nQ(cid:88)\n\nq=1\n\n+\n\nC\n2T 2\ny\n\n(cid:12)(cid:12)(cid:12)(cid:12)2\n\n+\n\n1\n\nQ \u2212 1\n\n.\n\n(17)\n\nIt turns out that it is possible to \ufb01nd the solution of this optimization problem without having to\nrecourse to any involved optimization procedure. This is akin to a result obtained by [6], which\napplies for another family of loss functions. We indeed prove the following theorem (proof in\nsupplementary material).\nTheorem 3 (COPA update). Suppose we want to solve\n\nW,(cid:80)\n\nmin\nq wq=0\n\n1\n2\n\n(cid:13)(cid:13)wq \u2212 wt\n\nq\n\n(cid:13)(cid:13)2\n\n2\n\nQ(cid:88)\n\nq=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:104)wq, x(cid:105) +\n\n(cid:88)\n\nq(cid:54)=y\n\n(cid:12)(cid:12)(cid:12)(cid:12)2\n\n+\n\n1\n\nQ \u2212 1\n\n.\n\n(18)\n\n+\n\nC\n2\n\n6\n\n\fAlgorithm 1 COPA\nInput: z = {(xt, yt)}T\nOutput: W = {w1, . . . , wQ}, the classi\ufb01cation vectors\n\nt=1 training set (realization of Z), R number of epochs over z, C cost\n\nreceive (xt, yt)\ncompute \u03b1\u2217 according to (20)\n\u2200q, perform the update: w\u03c4 +1\n\u03c4 \u2190 \u03c4 + 1\n\nq \u2190 w\u03c4\n\nq \u2212(cid:16)\n\n(cid:17)\n\n(cid:80)I\u2217\nq=1 \u03b1\u2217\n\nq\n\nxt\n\nq \u2212 1\n\u03b1\u2217\n\nQ\n\n\u03c4 = 0\nw0\n1 = . . . = w0\nQ\nfor r=1 to R do\n\nfor t=1 to T do\n\nend for\nend for\n\u2200q, wq \u2190 1\n\n\u03c4\n\n(cid:80)\u03c4\n\nk=1 wk\nq\n\nLet (cid:96)t\n\nq be de\ufb01ned as\n\n(19)\n\n(cid:88)\n\nq\u2208I\u2217\n\n(cid:96)t\nq.\n\n(20)\n\n, where s\u03b1(I\u2217)\n\n.\n=\n\nQ\n\n\u03baQ \u2212 I\u2217(cid:107)x(cid:107)2\n\nx, q = 1, . . . , Q\n\n(21)\n\nLet \u03c3 be a permutation de\ufb01ned over {1, . . . , Q \u2212 1} taking values in Y\\{y} such that\n\n= (cid:104)wq, x(cid:105) + 1/(Q \u2212 1).\n.\n\n(cid:96)t\nq\n\n\u03c3(1) \u2265 . . . \u2265 (cid:96)t\n(cid:96)t\nLet I\u2217 be the largest index I \u2208 {1, . . . , Q \u2212 1} such that\n\n\u03c3(Q\u22121).\n\n(cid:96)t\n\u03c3(I) +\n\n(cid:107)x(cid:107)2\n\n\u03baQ \u2212 (I \u2212 1)(cid:107)x(cid:107)2\n\n\u03c3(q) > 0, with \u03ba\n(cid:96)t\n\nIf I\u2217 is set to I\u2217 .\n\n= {\u03c3(1), . . . , \u03c3(I\u2217)}, then we may de\ufb01ne \u03b1\u2217 = [\u03b1\u2217\n\n+ (cid:107)x(cid:107)2\n\n.\n=\n\n1\nC\n1 \u00b7\u00b7\u00b7 \u03b1\u2217\n\nQ] as\n\nI\u22121(cid:88)\n\nq=1\n\n(cid:18)\n\n\uf8f1\uf8f2\uf8f3 1\n\n\u03ba\n0\n\n\u03b1\u2217\n\nq\n\n.\n=\n\n(cid:19)\n\n(cid:107)x(cid:107)2\nQ\n\n(cid:96)t\nq +\n\ns\u03b1(I\u2217)\n\nand the vectors\n\nw\u2217\n\nq\n\n.\n= wt\n\nq \u2212\n\nif q \u2208 I\u2217\notherwise\n\nq \u2212 1\n\u03b1\u2217\nQ\n\n(cid:32)\n\n(cid:33)\n\n\u03b1\u2217\n\nq\n\nI\u2217(cid:88)\n\nq=1\n\nare the solution of optimization problem (18). These equations provide COPA\u2019s update procedure.\n\nThe full COPA algorithm is depicted in Algorithm 1.\n\n4 Numerical Simulations\n\nWe here report results of preliminary simulations of COPA on a toy dataset. We generate a set of\n5000 samples according to three Gaussian distributions each of variance \u03c32I with \u03c3 = 0.5. One\nof the Gaussian is centered at (0, 1), the other at (1, 0) and the last one at (\u22121, 0). The respective\nweights of the Gaussian distributions are 0.9, 0.05 and 0.05. The \ufb01rst generated sample is used\nto choose the parameter C of COPA with a half split of the data between train and test; 10 other\nsamples of size 5000 are generated and split as 2500 for learning and 2500 for testing and the results\nare averaged on the 10 samples. We compare the results of COPA to that of a simple multiclass\nperceptron procedure (the number of epochs for each algorithm is set to 5). As recommended by the\ntheory we average the classi\ufb01cation vector to get our predictors (both for COPA and the perceptron).\nThe essential \ufb01nding of these simulations is that COPA and the perceptron achieve similar rate of\nclassi\ufb01cation accuracy, 0.85 and 0.86, respectively but the norm of the confusion matrix of COPA is\n0.10 and that of the Perceptron is 0.18. This means COPA indeed does its job in trying to minimize\nthe confusion risk.\n\n7\n\n\f5 Conclusion\n\nIn this paper, we have provided new bounds for online learning algorithms aiming at controlling\ntheir confusion risk. To the best of our knowledge, these results are the \ufb01rst of this kind. Motivated\nby the theoretical results, we have proposed a passive-aggressive learning algorithm, COPA, that\nhas the seducing property that its updates can be computed easily, without having to resort to any\noptimization package. Preliminary numerical simulations tend to support the effectiveness of COPA.\nIn addition to complementary numerical simulations, we want to pursue our work in several direc-\ntions. First, we would like to know whether ef\ufb01cient update equation can be derived if a simple\nhinge, instead of a square hinge is used. Second, we would like to see if a full regret-style analysis\ncan be made to study the properties of COPA. Also, we are interested in comparing the merits of our\ntheoretical results with those recently proposed in [5] and [7], which propose to address learning\nwith the confusion matrix from the algorithmic stability and the PAC-Bayesian viewpoints. Finally,\nwe would like to see how the proposed material can be of some use in the realm of structure predic-\ntion and by extension, in the case where the confusion matrix to consider is in\ufb01nite-dimensional.\nAcknowledgments. Work partially supported by Pascal 2 NoE ICT-216886-NOE, French ANR\nProjects ASAP (ANR-09-DEFIS\u2013001) and GRETA (ANR-12-BS02-0004).\n\nAppendix\nTheorem 4 (Matrix Azuma-Hoeffding, [8]). Consider a \ufb01nite sequence {Xk} of self-adjoint ma-\ntrices in dimension d, and a \ufb01xed sequence {Ak} of self-adjoint matrices that satisfy Ek\u22121Xk = 0\nand\n\n(cid:52) A2\n\nX 2\nk\n\nk, and AkXk = XkAk, almost surely.\n\nThen, for all t \u2265 0,\n\nwith \u03c32 =(cid:13)(cid:13)(cid:80)\n\nk A2\nk\n\nP\n\n(cid:13)(cid:13) .\n\n(cid:110)\n\n(cid:16)(cid:88)\n\n\u03bbmax\n\nXk\n\nk\n\n(cid:17) \u2265 t\n\n(cid:111) \u2264 d \u00b7 e\u2212t2/2\u03c32\n\n,\n\nProof of Corollary 1. To prove the result, it suf\ufb01ces to make use of the dilation technique and apply\nTheorem 4. The self-adjoint dilation D(U ) of a matrix U \u2208 Rd1\u00d7d2 is the self-adjoint matrix D(U )\nof order d1 + d2 de\ufb01ned by\n\n(cid:21)\n\n(cid:20) 0 U\n\nU\u2217\n\n0\n\nD(U ) =\n\nwhere U\u2217 is the adjoint of U (as U has only real coef\ufb01cient here, U\u2217 is the transpose of U).\nAs recalled in [8], (cid:107)D(U )(cid:107) = (cid:107)U(cid:107) and, therefore, the largest eigenvalue \u03bbmax of D2(U ) is equal to\n(cid:107)U(cid:107)2 and D2(U ) (cid:52) (cid:107)U(cid:107)2I.\nConsidering Uk, we get that, almost surely:\n\nD2(Uk) (cid:52) M 2\n\nI,\n\nk\n\nand since dilation is a linear operator, we have that\n\nEUk|U1\u00b7\u00b7\u00b7Uk\u22121D(Uk) = 0.\n\nThe sequence of matrices {D(Uk)} is therefore a matrix martingale that veri\ufb01es the assumption of\nTheorem 4, the application of which gives\nD(Uk)\n\n(cid:16)(cid:88)\n\n\u03bbmax\n\nP\n\n(cid:17) \u2265 t\n\n(cid:110)\nk . Thanks to the linearity of D,\n= \u03bbmax\n\u03bbmax\n\n(cid:16)(cid:88)\n\nD(Uk)\n\n(cid:17)\n\n(cid:111) \u2264 (d1 + d2) exp\n(cid:17)(cid:17)\n(cid:16)D(cid:16)(cid:88)\n\n=\n\nUk\n\nk\n\nwith \u03c32 =(cid:80)\n\n(cid:26)\n(cid:13)(cid:13)(cid:13)(cid:88)\n\n,\n\n(cid:27)\n(cid:13)(cid:13)(cid:13) ,\n\n\u2212 t2\n2\u03c32\n\nk M 2\n\nUk\n\nk\n\nk\n\nk\n\nwhich closes the proof.\n\n8\n\n\fReferences\n[1] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of online learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[2] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online\n\npassive-aggressive algorithms. Journal of Machine Learning Research, 7:551\u2013585, 2006.\n\n[3] Y. Lee. Multicategory support vector machines, theory, and application to the classi\ufb01cation of\n\nmicroarray data and satellite radiance data. Technical report, University of Wisconsin, 2002.\n\n[4] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Journal of the American\n\nStatistical Association, 99:67\u201381, march 2004.\n\n[5] P. Machart and L. Ralaivola. Confusion matrix stability bounds for multiclass classi\ufb01cation.\n\nTechnical report, Aix-Marseille University, 2012.\n\n[6] S. Matsushima, N. Shimizu, K. Yoshida, T. Ninomiya, and H. Nakagawa. Exact passive-\naggressive algorithm for multiclass classi\ufb01cation using support class. In SDM 10, pages 303\u2013\n314, 2010.\n\n[7] E. Morvant, S. Koc\u00b8o, and L. Ralaivola. PAC-Bayesian Generalization Bound on Confusion\nMatrix for Multi-Class Classi\ufb01cation. In John Langford and Joelle Pineau, editors, International\nConference on Machine Learning, pages 815\u2013822, Edinburgh, United Kingdom, 2012.\n\n[8] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computa-\n\ntional Mathematics, pages 1\u201346, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1515, "authors": [{"given_name": "Liva", "family_name": "Ralaivola", "institution": null}]}