{"title": "Necessary Intransitive Likelihood-Ratio Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 537, "page_last": 544, "abstract": "", "full_text": "Necessary Intransitive Likelihood-Ratio\n\nClassi\ufb01ers\n\nSSLI-Lab, Department of Electrical Engineering\n\nJeff Bilmes\n\nGang Ji\n\nand\n\nUniversity of Washington\nSeattle, WA 98195-2500\n\n{gang,bilmes}@ee.washington.edu\n\nAbstract\n\nIn pattern classi\ufb01cation tasks, errors are introduced because of differ-\nences between the true model and the one obtained via model estimation.\nUsing likelihood-ratio based classi\ufb01cation, it is possible to correct for this\ndiscrepancy by \ufb01nding class-pair speci\ufb01c terms to adjust the likelihood\nratio directly, and that can make class-pair preference relationships in-\ntransitive. In this work, we introduce new methodology that makes nec-\nessary corrections to the likelihood ratio, speci\ufb01cally those that are nec-\nessary to achieve perfect classi\ufb01cation (but not perfect likelihood-ratio\ncorrection which can be overkill). The new corrections, while weaker\nthan previously reported such adjustments, are analytically challenging\nsince they involve discontinuous functions, therefore requiring several\napproximations. We test a number of these new schemes on an isolated-\nword speech recognition task as well as on the UCI machine learning\ndata sets. Results show that by using the bias terms calculated in this\nnew way, classi\ufb01cation accuracy can substantially improve over both the\nbaseline and over our previous results.\n\n1 Introduction\n\nStatistical pattern recognition is often based on Bayes decision theory [4], which aims to\nachieve minimum error rate classi\ufb01cation. In previous work [2], we observed that multi-\nclass Bayes classi\ufb01cation can be viewed as a tournament style game, where the winner\nbetween players is decided using log likelihood ratios. Supposing the classes (players) are\n{c1, c2,\u00b7\u00b7\u00b7 , cM}, and the observation (game) is x, the winner of each pair of classes is\ndetermined, with the assumption of equal priors, by the sign of the log likelihood ratio\nLij(x) = ln P (x|ci)\nP (x|cj ) , in which case if Lij > 0 class ci wins and otherwise class cj wins.\nA practical game strategy can be obtained by \ufb01xing a comparison order, {i1, i2,\u00b7\u00b7\u00b7 , iM},\nas a permutation of {1, 2,\u00b7\u00b7\u00b7 , M}, where class ci1 plays with class ci2 , the winner plays\nwith class ci3 , and so on until a \ufb01nal winner is ultimately found. This yields a transitive\ngame [8] \u2014 assuming no ties, the ultimate winner is identical regardless of the comparison\norder.\n\nTo perform these procedures optimally, correct likelihood ratios are needed, which requires\ncorrect probabilistic models and suf\ufb01cient training data. This is never the case given a \ufb01-\n\n\fnite amount of training data or the wrong model family, typical in practice. In previous\nwork [2], we introduced a method to correct for the difference between the true and an\napproximate log likelihood ratio. In this work, we improve upon the correction method by\nusing an expression that can still lead to perfect correction, but is weaker than what we used\nbefore. We show that this new condition can achieve a signi\ufb01cant improvement over base-\nline results, both on a medium vocabulary isolated-word automatic speech recognition task\nand on the UCI machine learning data sets. The paper is organized as follows: Section 2\ndescribes the general scheme and describes past work. Section 3 discusses the weaker cor-\nrection condition, and its approximations. Section 4 provides various experimental results\non an isolated-word speech recognition task. Section 5 contains the experimental results\non the UCI data. Finally, Section 6 concludes.\n\n2 Background\n\nA common problem in many probabilistic machine learning settings is the lack of a cor-\nrect statistical model.\nIn a generative pattern classi\ufb01cation setting, this occurs because\nonly an estimated quantity \u02c6P (x|c)1 of a distribution is available, rather than the true class-\nconditional model P (x|c). In the likelihood ratio decision scheme described above, only\nan imperfect log likelihood ratio, \u02c6Lij(x) = ln( \u02c6P (x|ci)/ \u02c6P (x|cj)), is available for decision\nmaking rather than the true log likelihood ratio Lij(x).\nOne approach to correct for this inaccuracy is to use richer class conditional likelihoods,\nmore complicated parametric forms of Lij(x) itself, and/or more training data. In previous\nwork [2], we proposed a different approach that requires no change in generative models,\nno increase in free parameters, and no additional training data but still yields improved\naccuracy. The key idea is to compensate for the difference between Lij(x) and \u02c6Lij(x)\nusing a bias2 term \u03b1ij(x) computed from test data such that:\nLij(x) \u2212 \u03b1ij(x) = \u02c6Lij(x).\n\n(1)\nIf it is assumed that a single bias term is used for all data, so that \u03b1ij(x) = \u03b1ij, we found\nthat the best \u03b1ij is as follows:\n\n1\n2\n\n(D(ikj) \u2212 D(jki)) \u2212 1\n2\n\n\u03b1ij =\n\n(2)\nwhere D(ikj) = EP (x|ci) ln Lij(x) is the Kullback-Leibler (KL) divergence [3] between\nP (x|ci) and P (x|cj) and \u02c6D(ikj) = EP (x|ci)\n\u02c6Lij(x) is its estimation. Under the assumption\n(referred to as assumption A in Section 3.1) of symmetric KL-divergence for the true model\n(e.g., equal covariance matrices in the Gaussian case), the bias term can be solved explicitly\nas\n\n,\n\n(cid:16) \u02c6D(ikj) \u2212 \u02c6D(jki)\n(cid:17)\n\n(cid:16) \u02c6D(ikj) \u2212 \u02c6D(jki)\n(cid:17)\n\n\u03b1ij = \u22121\n2\n\n.\n\n(3)\nWe saw how the augmented likelihood ratio Sij(x) = \u02c6Lij(x) + \u03b1ij can lead to an in-\ntransitive game [8, 13], since Sij(x) can specify intransitive preferences amongst the set\n{1, 2,\u00b7\u00b7\u00b7 , M}. We therefore investigated a number of intransitive game playing strate-\ngies. Moreover, we observed that if the correction was optimal, the true likelihood ratios\nwould be obtained which are clearly transitive. We therefore hypothesized and experimen-\ntally veri\ufb01ed that the existence of intransitivity was a good indicator of the occurrence of a\nclassi\ufb01cation error.\n\nThis general approach can be improved upon in several ways. First, better intransitive\nstrategies can be developed (for detecting, tolerating, and utilizing the intransitivity of a\n\n1In this paper, we use \u201chatted\u201d letters to describe estimated quantities.\n2Note that by bias, we do not mean standard parameter bias in statistical parameter estimation.\n\n\fclassi\ufb01er); second, the assumption of symmetric KL-divergence could be relaxed; and third,\nthe above criterion is stricter than required to obtain perfect correction. In this work, we\nadvance on the latter two of the above three possible avenues for improvement.\n\n3 Necessary Intransitive Scheme\n\nAn \u03b1ij(x) that solves Equation 1 is a suf\ufb01cient condition for a perfect correction of the\nestimated likelihood ratio since given such a quantity, the true likelihood ratio would be\nattainable. This condition, however, is stricter than required because it is only the sign of\nthe likelihood ratio that is needed to decide the winning class. We therefore should ask for\na condition that corrects only for the discrepancy in sign between the true and estimated\nratio, i.e., we want to \ufb01nd a function \u03b1ij(x) that minimizes\n\nsgn [Lij(x) \u2212 \u03b1ij(x)] \u2212 sgn\u02c6Lij(x)\n\no2 \u00b7 Pij(x) dx.\n\nZ\n\nn\n\nJ[\u03b1ij] =\n\nRn\n\nClearly the \u03b1ij(x) that minimizes J[\u03b1ij] is the one such that\n\nsgn [Lij(x) \u2212 \u03b1ij(x)] = sgn\u02c6Lij(x),\n\n\u2200x \u2208 suppPij = {x : Pij(x) 6= 0}.\n\n(4)\n\nAs can be seen, this condition is weaker than Equation 1, weaker in the sense that any\nsolution to Equation 1 solves Equation 4 but not vice versa. Note also that Equation 4\nprovides necessary conditions for an additive bias term to achieve perfect correction, since\nany such correction must achieve parity in the sign. Therefore, it might make it simpler\nto \ufb01nd a better bias term since Equation 4 (and therefore, set of possible \u03b1 values) is less\nconstrained. As will be seen, however, analysis of this weaker condition is more dif\ufb01cult.\nIn the following sections, therefore we introduce several approximations to this condition.\nNote that as in previous work, we henceforth assume \u03b1ij(x) = \u03b1ij is a constant. In this\ncase, the equation providing the best \u03b1ij values is:\n\nn\n\no\n\nsgn\u02c6Lij(x)\n\n.\n\n(5)\n\nEPij {sgn [Lij(x) \u2212 \u03b1ij]} = EPij\n\n3.1 The dif\ufb01culty with the sign function\n\nThe main problem in trying to solve for \u03b1ij in Equation 5 is the existence of a discontinuous\nfunction. In this section, therefore, we work towards obtaining an analytically tractable\napproximation. The {\u22121, 0, 1}-valued sign function sgn(z) is de\ufb01ned as 2u(z) \u2212 1, where\nu(z) is the Heaviside step function. We obtain an approximation via a Taylor expansion as\nfollows:\n\nsgn(z + \u0001) = sgn(z) + \u0001sgn0(z) + o(\u0001) = sgn(z) + 2\u0001\u03b4(z) + o(\u0001),\n\n(6)\nwhere \u03b4(z) is the Dirac delta function [7]. It can be de\ufb01ned as the derivative of the Heav-\nf(z)\u03b4(z \u2212 z0) =\niside step function u0(z) = \u03b4(z), and it satis\ufb01es the sifting property\nf(z0). Therefore, it follows that [6, page 263]\n\nZ\n\nR\n\nwhere \u2207g is the gradient of g and Zg = {z \u2208 Rn : g(z) = 0} is the zero set of g with\nLebesgue measure \u00b5 [12].\n\nOf course, the Taylor expansion is valid only for a differentiable function, otherwise the\nerror terms can be arbitrarily large. If, however, we \ufb01nd and use a suitable continuous and\n\nZ\n\nRn\n\nZ\n\nf(z)\u03b4[g(z)] dz =\n\nf(z)\n|\u2207g(z)| \u00b7 d\u00b5,\n\nZg\n\n\fdifferentiable approximation rather than the discrete sign function, the above expansion\nbecomes more appropriate. There exists a trade-off, however, between the quality of the\nsign function approximation (a better sign function should yield a better approximation in\nEquation 4) and the error caused by the o(\u0001) term in Equation 6 (a better sign function\napproximation will have a greater error when the higher-order Taylor terms are dropped).\nWe therefore expect that ideally there will exist an optimal balance between the two. The\nshifted sigmoid with free parameter \u03b2 (de\ufb01ned and used below) allows us to easily explore\nthis trade-off simply by varying \u03b2.\n\nRetaining the \ufb01rst-order Taylor term, and applying this to the left side of Equation 5,\n\nEPij sgn [Lij(x) \u2212 \u03b1ij] \u2248 EPij sgnLij(x) \u2212 2EPij \u03b1ij\u03b4 [Lij(x)] .\n\nThe distribution under which the expectation in Equation 5 is taken can also in\ufb02uence our\nresults. If it is known that the true class of x is always ci, the ci-conditional distribution\nshould be used, i.e., Pij(x) = P (x|ci), yielding a class-conditional correction term \u03b1(i)\nij ,\nand a class-conditional likelihood-ratio correction S(i)\nij . The symmetric\ncase arises when x is of class cj. If, on the other hand, neither ci nor cj is the true classes\n(i.e., x is sampled from some other class-conditional distribution, say P (x|ck), k 6= i, j),\nit does not matter which distribution for Pij(x) is used since, for a given comparison order\nin a game playing strategy, the current winner will ultimately play using the true class\ndistribution P (x|ck) of x (when one of i or j will equal k). It is therefore valid to consider\nonly the case when either x is of class ci (we denote this event by Ci(x)) or when x is of\nclass cj (event Cj(x)). Note that these two events are disjoint.\nIn practice, however, we do not know which of the two events is correct. The ideal choice\nin either case can be expressed using indicators as follows:\n\nij (x) = \u02c6Lij(x)+\u03b1(i)\n\nAij(x) = \u03b1(i)\n\nij 1{Ci(x)} + \u03b1(j)\n\nij 1{Cj (x)}.\n\nTaking the expected value of Aij(X) with respect to p(x|Ci(x) \u2228 Cj(x)) yields\n\n\u03b1ij = Ep(x|Ci(x)\u2228Cj (x))[Aij(X)] =\n\n\u03b1(i)\nij P (ci) + \u03b1(j)\n\nij P (cj)\n\nP (ci) + P (cj)\n\n.\n\nThis results in a single likelihood correction Sij(x) = \u02c6Lij(x) + \u03b1ij that is obtained simply\nby integrating in Equation 5 with respect to the average distribution over class ci and cj,\ni.e.,\n\nPij(x) \u2206= p(x|Ci(x) \u2228 Cj(x)) = P (ci)P (x|ci) + P (cj)P (x|cj)\n\n.\n\nP (ci) + P (cj)\n\nWith these assumptions, and supposing the zero set ZLij = {x \u2208 Rn : P (x|ci) =\nZ\nP (x|cj)} of Lij(x) is Lebesgue measurable with measure \u00b5, we get:\n\nZ\n\n{sgnLij(x) \u2212 2\u03b1ij\u03b4 [Lij(x)]} Pij(x) dx =\n\nsgnLij(x)Pij(x) dx \u2212 2\u03a8(Pi, Pj)\u03b1ij,\n\nRn\n\nwhere\n\nTherefore,\n\n\u03a8(Pi, Pj) =\n\nZ\n\nRn\n\n\u03b1ij =\n\n1\n\n\u03a8(Pi, Pj)\n\nRn\n\nZ\n\nPij(x)\u03b4 [Lij(x)] dx =\n\nZLij\n\nsgnLij(x) \u2212 sgn\u02c6Lij(x)\n\n2\n\n\"\n\nZ\n\nRn\n\n(7)\n\nPij(x)\n|\u2207Lij(x)| \u00b7 d\u00b5.\n#\n\nPij(x) dx.\n\n\fAs can be seen, \u03b1ij is composed of two factors, the integral and the 1/\u03a8(Pi, Pj) factor.\nThe integral is bounded between -1 and 1 and determines the direction of the correction.\nWhen Lij(x) and \u02c6Lij(x) always agree, the integral is zero and there is no correction. The\ncorrection favors i when \u03b1ij is positive. This occurs when Lij is positive and \u02c6Lij is negative\nmore often than Lij is negative and \u02c6Lij is positive, a situation improved upon by giving i\n\u201chelp.\u201d Similarly, when \u03b1ij is negative, the correction biases towards j.\nThe maximum amount of absolute likelihood correction possible is determined by the (al-\nways positive) 1/\u03a8(Pi, Pj) factor. This is affected by two quantities, the mass around and\nthe log-likelihood ratio gradient at the decision boundary. Low mass at the decision bound-\nary increases the maximum possible correction because any errors in the integral factor\nare being de-weighted. High gradient at the decision boundary also increases the maxi-\nmum possible correction because any decision boundary deviation causes a higher change\nin likelihood ratio than if the gradient was low. Since we are correcting the likelihood ratio\ndirectly, this needs to be re\ufb02ected in \u03b1ij.\nWhen P (x|ci) and P (x|cj) are multivariate Gaussians with means \u00b5i and \u00b5j, identical\ncovariance matrices \u03a3, and equal priors, this becomes:\n\n\u03a8(Pi, Pj) =\n\ne\u2212 1\n\np2\u03c0(\u00b5i \u2212 \u00b5j)T \u03a3\u22121(\u00b5i \u2212 \u00b5j)\n\n8 (\u00b5i\u2212\u00b5j )T \u03a3\u22121(\u00b5i\u2212\u00b5j )\n\nAs the means diverge from each other, both the mass at the decision boundary decreases\nand the likelihood-ratio gradient increases, thereby increasing the maximum amount of\ncorrection.\nUnfortunately, it is quite dif\ufb01cult to explicitly evaluate \u03a8(Pi, Pj) without knowing the true\nprobability distributions. In this initial work, therefore, our investigations simplify by only\ncomputing the direction and not the magnitude of the correction. As will be seen, this\nassumption yields a likelihood-ratio adjustment that is similar in form to our previous KL-\ndivergence based adjustment. More practically, the assumption signi\ufb01cantly simpli\ufb01es the\nderivation and still yields reasonable empirical results. Under this assumption, expression\nfor \u03b1ij becomes:\n\n\u03b1ij =\n\n1\n\n2 EPij (x)[sgnLij(x)] \u2212 1\n\n2 EPij (x)[sgn\u02c6Lij(x)].\n\n(8)\n\nThe left term on the right of the equality is quite similar to the left difference on the right\nof the equality in the KL-divergence case (Equation 2). Again, because we have no infor-\nmation about the true class conditional models, we assume the left term in Equation 8 to\nbe zero (denote this as assumption B). Comparing this with the corresponding assumption\nfor the KL-divergence case (assumption A, Equations 2 and 3), it can be shown that 1) they\nare not identical in general, and 2) in the Gaussian case, A implies B but not vice versa,\nmeaning B is weaker than A.\nUnder assumption B, an expression for the resulting \u03b1ij can be derived using the weak law\nof large numbers yielding:\n\n\u03b1ij \u2248\n\n1\n\n2(Ni + Nj)\n\nsgn ln\n\n\u02c6P (x|cj)\n\u02c6P (x|ci)\n\nsgn ln\n\n\u02c6P (x|ci)\n\u02c6P (x|cj)\n\n(9)\n\nwhere x \u2208 Ci and x \u2208 Cj correspond to the samples as they are classi\ufb01ed in a previous\nrecognition pass; Ni and Nj are number of samples from model ci and cj respectively. One\ncan immediately see the similarity between this equation and the one using KLD [2].\n\nLike in [2], since the true classes are unknown, we perform a previous classi\ufb01cation pass\n(e.g., using the original likelihood ratios) to get estimates and use these in Equation 9.\n\n\uf8eb\uf8edX\n\nx\u2208Ci\n\n\u2212 X\n\nx\u2208Cj\n\n\uf8f6\uf8f8 ,\n\n\fNote that there are three potential sources of error in the analysis above. The \ufb01rst is the\n\u03a8(Pi, Pj) factor that we neglected. The second is assumption B, that (since weaker) can\nbe less severe than in the corresponding KL-divergence case. The third is the error due to\nthe discontinuity of the sign function. To address the third problem, rather than using the\nsign function in Equation 9, we can approximate it with a continuous differential function\nwith the goal of balancing the trade-off mentioned above. There are a number of possible\nsign-function approximations, including hyperbolic and arc tangent, and shifted sigmoid\nfunction, the latter of which is the most \ufb02exible because of its free parameter \u03b2.3\nSpeci\ufb01cally, the sigmoid function has the form f(z) =\n1+e\u2212\u03b2z , where the free parameter \u03b2\n(an inverse temperature) determines how well the curve will approximate the discontinuous\nfunction. Using the sigmoid function, we can approximate the sign function as sgnz \u2248\n1+e\u2212\u03b2z \u2212 1. Note that the approximation improves as \u03b2 increases. Hence,\n2\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n1\n\n2\n\n1\n\n\u03b1ij \u2248\n\n2(Ni + Nj)\n\n1 \u2212\n\n2\n\n1 + e\u03b2 \u02c6Lji(x)\n\n\uf8ee\uf8f0X\n\nx\u2208ci\n\n(cid:19)\uf8f9\uf8fb .\n\n1 \u2212\n\n1 + e\u03b2 \u02c6Lij (x)\n\n(10)\n\n\u2212X\n\nx\u2208cj\n\n4 Speech Recognition Evaluation\n\nAs in previous work [2], we implemented this technique on NYNEX PHONEBOOK [10,\n1], a medium vocabulary isolated-word speech corpus. Gaussian mixture hidden Markov\nmodels (HMMs) produced probability scores \u02c6P (x|ci) where here x is a matrix of feature\nvalues (one dimension as MFCC features and the other as time frames), and ci is a word\nidentity. The HMMs use four hidden states per phone, and 12 Gaussian mixtures per state\n(standard for this task [10]). This yields approximately 200k free model parameters in total.\nIn our experiments, the steps are: 1) calculate \u02c6P (x|ci) using full inference (no Viterbi\napproximation) for each test case and for each word; 2) classify the test examples using\njust the log likelihood ratios \u02c6Lij = ln \u02c6P (x|ci)/ \u02c6P (x|cj); 3) using the hypothesized (and\nerror-full) class labels, calculate the test-set bias term using one of the techniques described\nabove; and 4) classify again using the augmented likelihood ratio Sij = \u02c6Lij + \u03b1ij. Since\nthe procedure is no longer transitive, we run 1000 random tournament-style games (as in\n[2]) and choose the most frequent winner as the ultimate winner.\n\nTable 1: Word error rates % on speech data with various sign approximations.\n\nSIZE\n75\n150\n300\n600\n\nORIG\n2.34\n3.31\n5.23\n7.39\n\nSIGN\n1.76\n2.83\n4.75\n6.64\n\nTANH\n1.76\n2.84\n4.75\n6.61\n\nATAN\n1.76\n2.83\n4.70\n6.60\n\nSIG(.1)\n1.82\n2.65\n4.74\n6.66\n\nSIG(1)\n1.76\n2.83\n4.75\n6.64\n\nSIG(10)\n\nSIG(100)\n\nSIG(200)\n\nSIG(400)\n\nKLD[2]\n\n1.56\n2.65\n4.29\n6.04\n\n1.57\n2.47\n3.95\n5.70\n\n1.33\n2.68\n4.34\n6.74\n\n1.34\n2.43\n4.34\n6.74\n\n1.91\n2.72\n4.29\n5.91\n\nThe results are shown in Table 1, where the \ufb01rst column gives the test-set vocabulary size\n(number of different classes). The second column shows the baseline word error rates\n(WERs) using only \u02c6Lij. The remaining columns are the bias-corrected results with various\nsign approximations, namely sign (Equation 9), hyperbolic and arc tangent, and the shifted\nsigmoid with various \u03b2 values (thus allowing us to investigate the trade-off mentioned in\nSection 3.1). From the results we can see that larger-\u03b2 sigmoid is usually better, with\noverall performance increasing with \u03b2. This is because with large \u03b2, the shifted sigmoid\ncurve better approximates the sign function. For \u03b2 = 100, the results are even better than\nour previous KL-divergence (KLD) results reported in [2] (right-most column in the table).\nIt can also been seen that when \u03b2 is greater than 100, the WERs are not consistently better.\nThis indicates that the inaccuracies due to the Taylor error term start adversely affecting\nthe results at around \u03b2 = 100.\n\n3Note that the other soft sign functions can also be de\ufb01ned to utilize a \u03b2 smoothness parameter.\n\n\f5 UCI Dataset Evaluation\n\nTable 2: Error rates in % (and std where applicable) on the UCI data.\n\ndata\n\naustralian\n\nbreast\nchess\ncleve\ncorral\ncrx\n\ndiabetes\n\n\ufb02are\n\ngerman\nglass\nglass2\nheart\n\nhepatitis\n\niris\nletter\n\nNN baseline\n16.75(3.51)\n2.94(1.16)\n\n0.56\n\n25.67(3.40)\n2.44(1.26)\n17.41(3.18)\n28.04(3.08)\n20.98(2.26)\n29.96(3.49)\n42.16(2.06)\n28.82(2.57)\n21.83(3.77)\n19.46(7.10)\n8.13(1.60)\n\n38.66\n\nKLD\n\n16.33(3.66)\n2.62(1.15)\n\n0.46\n\n24.35(2.82)\n1.82(1.16)\n17.25(2.67)\n26.88(3.56)\n19.37(2.16)\n28.54(3.45)\n39.63(1.76)\n26.23(2.61)\n21.48(4.26)\n16.10(6.13)\n6.84(1.44)\n\n34.66\n\nsign\n\n16.17(3.63)\n2.63(1.15)\n\n0.47\n\n24.01(2.27)\n1.19(1.16)\n17.11(2.91)\n27.41(4.13)\n18.29(2.25)\n28.82(2.53)\n41.92(1.92)\n26.95(2.65)\n21.19(4.52)\n17.16(6.92)\n6.26(1.47)\n\n37.10\n\nsig(10)\n\n16.32(3.75)\n2.65(1.15)\n\n0.37\n\n24.01(3.94)\n1.19(1.16)\n17.26(3.00)\n27.18(1.98)\n18.46(1.85)\n28.25(3.71)\n40.95(2.00)\n26.23(2.57)\n21.09(4.23)\n15.82(6.94)\n6.84(1.44)\n\n37.00\n\nNB baseline\n14.89(1.97)\n2.45(1.93)\n\n12.66\n\n17.91(2.37)\n12.77(3.66)\n15.05(3.67)\n25.71(2.13)\n20.24(2.31)\n24.58(2.57)\n44.12(7.96)\n22.36(9.01)\n15.50(6.01)\n16.18(5.92)\n6.99(1.78)\n\n30.68\n\nKLD\n\n14.29(2.45)\n2.29(2.02)\n\n12.76\n\n15.55(1.81)\n9.57(2.12)\n14.02(3.91)\n24.79(2.68)\n19.55(2.63)\n26.55(1.88)\n42.24(8.64)\n21.15(9.25)\n15.11(5.34)\n18.29(5.96)\n6.99(1.78)\n\n30.88\n\nsign\n\n14.76(2.45)\n2.13(2.07)\n\n13.04\n\n15.22(1.82)\n9.57(2.62)\n13.06(3.67)\n24.24(3.49)\n18.70(1.87)\n24.79(2.30)\n42.06(9.22)\n21.77(9.25)\n15.11(5.72)\n18.04(5.92)\n6.99(1.78)\n\n30.48\n\nsig(10)\n\n14.76(2.37)\n1.86(2.07)\n\n12.85\n\n16.22(2.61)\n12.05(4.80)\n15.05(3.67)\n24.66(2.59)\n16.64(2.34)\n24.25(2.50)\n42.28(7.93)\n22.36(9.01)\n15.11(6.01)\n15.45(4.56)\n6.99(1.78)\n\n30.64\n\nlymphography\nmofn-3-7-10\n\npima\n\nsatimage\nsegment\n\nshuttle-small\nsoybean-large\n\nvehicle\n\nvote\n\nwaveform-21\n\n24.46(4.86)\n\n23.81(4.57)\n\n23.29(4.52)\n\n23.29(4.86)\n\n16.62(8.64)\n\n18.27(9.25)\n\n17.34(8.91)\n\n15.31(8.91)\n\n0\n\n0\n\n0\n\n0\n\n8.59\n\n4.57\n\n1.56\n\n3.42\n\n25.96(2.01)\n\n25.22(2.95)\n\n24.82(2.87)\n\n25.96(2.19)\n\n25.71(2.13)\n\n24.79(2.68)\n\n24.24(3.49)\n\n24.66(2.59)\n\n15.80\n7.53\n0.87\n\n8.47(1.31)\n28.39(4.68)\n7.40(2.22)\n\n26.21\n\n14.25\n7.40\n0.77\n\n14.40\n7.27\n0.87\n\n14.25\n7.53\n0.77\n\n19.15\n12.21\n1.40\n\n19.35\n11.73\n1.41\n\n19.25\n11.82\n1.50\n\n18.70\n12.21\n1.50\n\n8.29(1.39)\n28.15(4.62)\n6.94(1.77)\n\n26.17\n\n7.18(1.08)\n27.70(4.44)\n6.94(1.77)\n\n26.12\n\n8.47(1.31)\n28.39(4.75)\n7.17(2.05)\n\n26.14\n\n8.71(2.70)\n38.92(4.47)\n9.91(1.72)\n\n21.45\n\n9.13(2.60)\n38.59(5.05)\n9.68(2.49)\n\n21.11\n\n8.35(2.65)\n38.79(4.46)\n9.68(1.72)\n\n20.15\n\n8.37(2.70)\n37.84(4.43)\n9.68(1.72)\n\n21.40\n\nIn order to show that our methodology is general beyond isolated-word speech recognition,\nwe also evaluated this technique on the entire UCI machine learning repository [9]. In\nour experiments, baseline classi\ufb01ers are built using one of: 1) the Matlab neural network\n(NN) toolbox with feed-forward 3-layer perceptrons having different number of hidden\nunits and training epochs (optimized over a large set to achieve the best possible baseline\nfor each test case), and trained using the Levenberg-Marquardt algorithm [11], or 2) the\nMLC++ toolbox to produce na\u00a8\u0131ve Bayes (NB) classi\ufb01ers that have been smoothed using\nDirichlet priors. In each case (i.e., NN or NB), we augmented the resulting likelihood ratios\nwith bias correction terms thereby evaluating our technique using quite different forms of\nbaseline classi\ufb01ers. Unlike the above, with these data sets we have only tried one random\ntournament game to decide the winner so far.\n\nFor the NN results, hidden units use logistic sigmoid, and output units use a soft-max\nfunction, making the network outputs interpretable as posterior probabilities P (c|x), where\nx is the sample and c is the class. While our bias correction described above is in terms\nof likelihoods ratios Lij(x), posteriors can be used as well if the posteriors are divided by\nthe priors giving the relation P (c|x)/P (c) = P (x|c)/p(x) (i.e., scaled likelihoods) which\nproduces the standard Lij(x) values when used in a likelihood ratio .\nAs was done in [5], for the small data sets the experimental results use 5-fold cross-\nvalidation using randomly selected chunks \u2014 results show mean and standard deviation\n(std) in parentheses. For the larger data sets, we use the same held out training/test sets\nas in [5] (so std is not shown). The experimental procedure is similar to that described in\nSection 4, except that scaled likelihoods are used for the NN baselines. Again, \ufb01rst-pass\nerror-full test-set hypothesized answers are used to compute the bias corrections.\n\nTable 5 shows our results for both the NN (columns 2\u20145) and NB (columns 6\u20149) base-\nline classi\ufb01ers. Within each baseline group, the \ufb01rst column shows the baseline accuracy\n(with the 5-fold standard derivations when the data set is small). The second column shows\nresults using KL-divergence based bias corrections \u2014 these are the \ufb01rst published KLD\nresults on the UCI data. The third column shows results with sign-based correction (Equa-\ntion 9), and the forth column shows the sigmoid (\u03b2 = 10) case (Equation 10).\nWhile not the point of this paper, one immediately sees that the NB baseline results are\noften better than the NN baseline results (15 out of 25 times). Using the NN as a baseline,\n\n\fthe table shows that the KLD results are almost always better than the baseline 24 times\n(out of 25). Also, the sign correction is better than the baseline 23 out of 25 times, and\nthe sigmoid(10) results are better 20 times. Also (not shown in the table), we found that\n\u03b2 = 10 is slightly better than \u03b2 = 1 but there is no advantage using \u03b2 = 100. These results\ntherefore show that the NN KLD correction typically beats the sign and sigmoid correction,\npossibly owing to the error in the Taylor approximation. Using the NB classi\ufb01er as the\nbaseline, however, shows not only improved baseline results in general but also that the\nsigmoid(10) improves more often. Speci\ufb01cally, the KLD results are better than the baseline\n16 times, sign is better than the baseline 18 times, and sigmoid(10) beats the baseline 19\ntimes, suggesting that sigmoid(10) typically wins over the KLD case.\n\n6 Discussion\n\nWe have introduced a new necessary intransitive likelihood ratio classi\ufb01er. This was done\nby using sign-based corrections to likelihood ratios and by using continuous differentiable\napproximations of the sign function in order to be able to vary the inherent trade-off be-\ntween sign-function approximation accuracy and Taylor error. We have applied these tech-\nniques to both a speech recognition corpus and the UCI data sets, as well as applying\nprevious KL-divergence based corrections to the latter data. Results on the UCI data sets\ncon\ufb01rm that our techniques reasonably generalize to data sets other than speech recogni-\ntion. This suggests that the framework could be applied to other machine learning tasks.\n\nThis work was supported in part by NSF grant IIS-0093430 and IIS-0121396.\n\nReferences\n\n[1] Jeff Bilmes. Burried Markov models for speech recognition. In IEEE Intl. Conf. on Acoustics,\n\nSpeech, and Signal Processing, March 1999.\n\n[2] Jeff Bilmes, Gang Ji, and M. Meil\u02d8a. Intransitive likeilhood-ratio classi\ufb01ers. In Neural Informa-\n\ntion Processing Systems: Natural and Synthetic, December 2001.\n\n[3] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, Inc.,\n\n1991.\n\n[4] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classi\ufb01cation. John Wiley and\n\nSons, second edition, 2001.\n\n[5] Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classi\ufb01ers. Machine\n\nLearning, 29(2-3):131\u2013163, 1997.\n\n[6] D. S. Jones. Generalised Functions. McCraw-Hill Publishing Company Limited, 1966.\n[7] J. Kevorkian. Partial Differential Equations: Analytical Solution Techniques. New York:\n\nSpringer, 2000.\n\n[8] R. Duncan Luce and Howard Raiffa. Games and Decisions: Introduction and Critical Survey.\n\nDover, 1957.\n\n[9] P. M. Murphy and D. W. Aha. UCI Repository of Machine Learning Database, 1995.\n[10] J. Pitrelli, C. Fong, S. H. Wong, J. R. Spitz, and H. C. Lueng. PhoneBook: a phonetically-rich\nisolated-word telephone-speech database. In IEEE Intl. Conf. on Acoustics, Speech, and Signal\nProcessing, 1995.\n\n[11] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C: The\nArt of Scienti\ufb01c Computing. Cambridge University Press, Cambridge, England, second edition,\n1992.\n\n[12] M. M. Rao. Measure Theory and Integration. John Wiley and Sons, 1987.\n[13] P. D. Straf\ufb01n. Game Theory and Strategy. The Mathematical Association of America, 1993.\n\n\f", "award": [], "sourceid": 2521, "authors": [{"given_name": "Gang", "family_name": "Ji", "institution": null}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": null}]}