{"title": "A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 1385, "page_last": 1392, "abstract": "", "full_text": "A Kullback-Leibler Divergence Based Kernel for\nSVM Classi\ufb01cation in Multimedia Applications\n\nPedro J. Moreno Purdy P. Ho\n\nHewlett-Packard\n\nCambridge Research Laboratory\n\nCambridge, MA 02142, USA\n\n{pedro.moreno,purdy.ho}@hp.com\n\nNuno Vasconcelos\n\nUCSD ECE Department\n\n9500 Gilman Drive, MC 0407\n\nLa Jolla, CA 92093-0407\nnuno@ece.ucsd.edu\n\nAbstract\n\nOver the last years signi\ufb01cant efforts have been made to develop kernels\nthat can be applied to sequence data such as DNA, text, speech, video\nand images. The Fisher Kernel and similar variants have been suggested\nas good ways to combine an underlying generative model in the feature\nspace and discriminant classi\ufb01ers such as SVM\u2019s. In this paper we sug-\ngest an alternative procedure to the Fisher kernel for systematically \ufb01nd-\ning kernel functions that naturally handle variable length sequence data\nin multimedia domains. In particular for domains such as speech and\nimages we explore the use of kernel functions that take full advantage\nof well known probabilistic models such as Gaussian Mixtures and sin-\ngle full covariance Gaussian models. We derive a kernel distance based\non the Kullback-Leibler (KL) divergence between generative models. In\neffect our approach combines the best of both generative and discrim-\ninative methods and replaces the standard SVM kernels. We perform\nexperiments on speaker identi\ufb01cation/veri\ufb01cation and image classi\ufb01ca-\ntion tasks and show that these new kernels have the best performance\nin speaker veri\ufb01cation and mostly outperform the Fisher kernel based\nSVM\u2019s and the generative classi\ufb01ers in speaker identi\ufb01cation and image\nclassi\ufb01cation.\n\n1 Introduction\n\nDuring the last years Support Vector Machines (SVM\u2019s) [1] have become extremely suc-\ncessful discriminative approaches to pattern classi\ufb01cation and regression problems. Ex-\ncellent results have been reported in applying SVM\u2019s in multiple domains. However, the\napplication of SVM\u2019s to data sets where each element has variable length remains problem-\natic. Furthermore, for those data sets where the elements are represented by large sequences\nof vectors, such as speech, video or image recordings, the direct application of SVM\u2019s to\nthe original vector space is typically unsuccessful.\n\nWhile most research in the SVM community has focused on the underlying learning al-\ngorithms the study of kernels has also gained importance recently. Standard kernels such\nas linear, Gaussian, or polynomial do not take full advantage of the nuances of speci\ufb01c\ndata sets. This has motivated plenty of research into the use of alternative kernels in the\n\n\fareas of multimedia. For example, [2] applies normalization factors to polynomial kernels\nfor speaker identi\ufb01cation tasks. Similarly, [3] explores the use of heavy tailed Gaussian\nkernels in image classi\ufb01cation tasks. These approaches in general only try to tune standard\nkernels (linear, polynomial, Gaussian) to the nuances of multimedia data sets.\n\nOn the other hand statistical models such as Gaussian Mixture Models (GMM) or Hidden\nMarkov Models make strong assumptions about the data. They are simple to learn and\nestimate, and are well understood by the multimedia community. It is therefore attractive\nto explore methods that combine these models and discriminative classi\ufb01ers. The Fisher\nkernel proposed by Jaakkola [4] effectively combines both generative and discriminative\nclassi\ufb01ers for variable length sequences. Besides its original application in genomic prob-\nlems it has also been applied to multimedia domains, among others [5] applies it to audio\nclassi\ufb01cation with good results; [6] also tries a variation on the Fisher kernel on phonetic\nclassi\ufb01cation tasks.\n\nWe propose a different approach to combine both discriminative and generative methods to\nclassi\ufb01cation. Instead of using these standard kernels, we leverage on successful generative\nmodels used in the multimedia \ufb01eld. We use diagonal covariance GMM\u2019s and full covari-\nance Gaussian models to better represent each individual audio and image object. We then\nuse a metric derived from the symmetric Kullback-Leibler (KL) divergence to effectively\ncompute inner products between multimedia objects.\n\n2 Kernels for SVM\u2019s\n\nMuch of the \ufb02exibility and classi\ufb01cation power of SVM\u2019s resides in the choice of kernel.\nSome examples are linear, polynomial degree p, and Gaussian. These kernel functions\nhave two main disadvantages for multimedia signals. First they only model inner products\nbetween individual feature vectors as opposed to an ensemble of vectors which is the typical\ncase for multimedia signals. Secondly these kernels are quite generic and do not take\nadvantage of the statistics of the individual signals we are targeting.\n\nThe Fisher kernel approach [4] is a \ufb01rst attempt at solving these two issues. It assumes the\nexistence of a generative model that explains well all possible data. For example, in the\ncase of speech signals the generative model p(x|\u03b8) is often a Gaussian mixture. Where the\n\u03b8 model parameters are priors, means, and diagonal covariance matrices. GMM\u2019s are also\nquite popular in the image classi\ufb01cation and retrieval domains; [7] shows good results on\nimage classi\ufb01cation and retrieval using Gaussian mixtures.\nFor any given sequence of vectors de\ufb01ning a multimedia object X = {x1, x2, . . . , xm} and\nQm\nassuming that each vector in the sequence is independent and identically distributed, we\ncan easily de\ufb01ne the likelihood of the ensemble being generated by p(x|\u03b8) as P (X|\u03b8) =\ni=1 p(xi|\u03b8). The Fisher score maps each individual sequence {X1, . . . , Xn}, composed\nof a different number of feature vectors, into a single vector in the gradient log-likelihood\nspace.\n\nThis new feature vector, the Fisher score, is de\ufb01ned as\n\nUX = \u2207\u03b8log(P (X|\u03b8))\n\n(1)\n\nEach component of UX is a derivative of the log-likelihood of the vector sequence X\nwith respect to a particular parameter of the generative model. In our case the parameters\n\u03b8 of the generative model are chosen from either the prior probabilities, the mean vector\nor the diagonal covariance matrix of each individual Gaussian in the mixture model. For\nexample, if we use the mean vectors as our model parameters \u03b8, i.e., for \u03b8 = \u00b5k out of K\npossible mixtures, then the Fisher score is\n\n\f\u2207\u00b5k log(P (X|\u00b5k)) =\n\nmX\n\ni=1\n\nP (k|xi)\u03a3\u22121\n\nk (xi \u2212 \u00b5k)\n\n(2)\n\nwhere P (k|xi) represents the a posteriori probability of mixture k given the observed\nfeature vector xi. Effectively we transform each multimedia object (audio or image) X of\nvariable length into a single vector UX of \ufb01xed dimension.\n\n3 Kullback-Leibler Divergence Based Kernels\nWe start with a statistical model p(x|\u03b8i) of the data, i.e., we estimate the parameters \u03b8i\nof a generic probability density function (PDF) for each multimedia object (utterance or\nimage) Xi = {x1, x2, . . . , xm}. We pick PDF\u2019s that have been shown over the years to\nbe quite effective at modeling multimedia patterns. In particular we use diagonal Gaussian\nmixture models and single full covariance Gaussian models. In the \ufb01rst case the parameters\n\u03b8i are priors, mean vectors, and diagonal covariance matrices while in the second case the\nparameters \u03b8i are the mean vector and full covariance matrix.\nOnce the PDF p(x|\u03b8i) has been estimated for each training and testing multimedia object\nwe replace the kernel computation in the original sequence space by a kernel computation\nin the PDF space:\n\nK(Xi, Xj) =\u21d2 K(p(x|\u03b8i), p(x|\u03b8j))\n\n(3)\n\nTo compute the PDF parameters \u03b8i for a given object Xi we use a maximum likelihood\napproach. In the case of diagonal mixture models there is no analytical solution for \u03b8i\nand we use the Expectation Maximization algorithm. In the case of single full covariance\nGaussian model there is a simple analytical solution for the mean vector and covariance\nmatrix. Effectively we are proposing to map the input space Xi to a new feature space \u03b8i.\nNotice that if the number of vector in the Xi multimedia sequence is small and there is not\nenough data to accurately estimate \u03b8i we can use regularization methods, or even replace\nthe maximum likelihood solution for \u03b8i by a maximum a posteriori solution. Other solu-\ntions like starting from a generic PDF and adapting its parameters \u03b8i to the current object\nare also possible.\n\nThe next step is to de\ufb01ne the kernel distance in this new feature space. Because of the statis-\ntical nature of the feature space a natural choice for a distance metric is one that compares\nPDF\u2019s. From the standard statistical literature there are several possible choices, however,\nin this paper we only report our results on the symmetric Kullback-Leibler (KL) divergence\n\n\u221eZ\n\n\u2212\u221e\n\np(x|\u03b8j) log( p(x|\u03b8j)\np(x|\u03b8i)\n\n) dx\n\n(4)\n\nD(p(x|\u03b8i), p(x|\u03b8j)) =\n\np(x|\u03b8i) log( p(x|\u03b8i)\np(x|\u03b8j)\n\n) dx +\n\n\u221eZ\n\n\u2212\u221e\n\nBecause a matrix of kernel distances directly based on symmetric KL divergence does not\nsatisfy the Mercer conditions, i.e., it is not a positive de\ufb01nite matrix, we need a further step\nto generate a valid kernel. Among many posibilities we simply exponentiate the symmet-\nric KL divergence, scale, and shift (A and B factors below) it for numerical stability reasons\n\nK(Xi, Xj) =\u21d2 K(p(x|\u03b8i), p(x|\u03b8j))\n\n=\u21d2 e\u2212A D(p(x|\u03b8i),p(x|\u03b8j ))+B\n\n(5)\n\n\fIn the case of Gaussian mixture models the computation of the KL divergence is not direct.\nIn fact there is no analytical solution to Eq. (4) and we have to resort to Monte Carlo\nmethods or numerical approximations. In the case of single full covariance models the KL\ndivergence has an analytical solution\n\nD(p(x|\u03b8i), p(x|\u03b8j)) = tr(\u03a3i \u03a3\u22121\n\n2 S + tr((\u03a3\u22121\n\ni + \u03a3\u22121\n\nj ) + tr(\u03a3j \u03a3\u22121\n\n)\u2212\nj ) (\u00b5i \u2212 \u00b5j)(\u00b5i \u2212 \u00b5j)T )\n\ni\n\n(6)\n\nwhere S is the dimensionality of the original feature data x. This distance is similar to the\nArithmetic harmonic sphericity (AHS) distance quite popular in the speaker identi\ufb01cation\nand veri\ufb01cation research community [8].\n\nNotice that there are signi\ufb01cant differences between our KL divergence based kernel and\nthe Fisher kernel method. In our approach there is no underlying generative model to repre-\nsent all the data. We do not use a single PDF (even if it encodes a latent variable indicative\nof class membership) as a way to map the multimedia object from the original feature vector\nspace to a gradient log-likelihood vector space. Instead each individual object (consisting\nof a sequence of feature vectors) is modeled by its unique PDF. This represents a more lo-\ncalized version of the Fisher kernel underlying generative model. Effectively the modeling\npower is spent where it matters most, on each of the individual objects in the training and\ntesting sets. Interestingly, the object PDF does not have to be extremely complex. As we\nwill show in our experimental section a single full covariance Gaussian model produces\nextremely good results. Also, in our approach there is not a true intermediate space unlike\nthe gradient log-likelihood space used in the Fisher kernel. Our multimedia objects are\ntransformed directly into PDF\u2019s.\n\n4 Audio and Image Databases\n\nWe chose the 50 most frequent speakers from the HUB4-96 [9] News Broadcasting corpus\nand 50 speakers from the Narrowband version of the KING corpus [10] to train and test\nour new kernels on speaker identi\ufb01cation and veri\ufb01cation tasks. The HUB training set\ncontains about 25 utterances (each 3-7 seconds long) from each speaker, resulting in 1198\nutterances (or about 2 hours of speech). The HUB test set contains the rest of the utterances\nfrom these 50 speakers resulting in 15325 utterances (or about 21 hours of speech). The\nKING corpus is commonly used for speaker identi\ufb01cation and veri\ufb01cation in the speech\ncommunity [11]. Its training set contains 4 utterances (each about 30 seconds long) from\neach speaker and the test set contains the remaining 6 from these 50 speakers. A total of 200\ntraining utterances (about 1.67 hours of speech) and 300 test utterances (about 2.5 hours\nof speech) were used. Following standard practice in speech processing each utterance\nwas transformed into a sequence of 13 dimensional Mel-Frequency Cepstral vectors. The\nvectors were augmented with their \ufb01rst and second order time derivatives resulting in a\n39 dimensional feature vector. We also mean-normalized the KING utterances in order to\ncompensate for the distortion introduced by different telephone channels. We did not do\nso for the HUB experiments since mean normalizing the audio would remove important\nspeaker characteristics.\n\nWe chose the Corel database [12] to train and test all algorithms on image classi\ufb01cation.\nCOREL contains a variety of objects, such as landscape, vehicles, plants, and animals. To\nmake the task more challenging we picked 8 classes of highly confusable objects: Apes,\nArabianHorses, Butter\ufb02ies, Dogs, Owls, PolarBears, Reptiles, and RhinosHippos. There\nwere 100 images per class \u2013 66 for training and 34 for testing; thus, a total of 528 training\nimages and 272 testing images were used. All images are 353x225 pixel 24-bit RGB-color\nJPEGs. To extract feature vectors we followed standard practice in image processing. For\n\n\feach of the 3 color channels the image was scanned by an 8x8 window shifted every 4\npixels. The 192 pixels under each window were converted into a 192-dimensional Discrete\nCosine Transform (DCT) feature vector. After this only the 64 low frequency elements\nwere used since they captured most of the image characteristics.\n\n5 Experiments and Results\n\nOur experiments trained and tested \ufb01ve different types of classi\ufb01ers: Baseline GMM, Base-\nline AHS1, SVM using Fisher kernel, and SVM using our new KL divergence based ker-\nnels.\n\nWhen training and testing our new GMM/KL Divergence based kernels, a sequence of\nfeature vectors, {x1, x2, . . . , xm} from each utterance or image X was modeled by a single\nGMM of diagonal covariances. Then the KL divergences between each of these GMM\u2019s\nwere computed according to Eq. (4) and transformed according to Eq. (5). This resulted\nin kernel matrices for training and testing that could be feed directly into a SVM classi\ufb01er.\nSince all our SVM experiments are multiclass experiments we used the 1-vs-all training\napproach. The class with the largest positive score was designated as the winner class. For\nthe experiments in which the object PDF was a single full covariance Gaussian we followed\na similar procedure. The KL divergences between each pair of PDF\u2019s were computed\naccording to Eq. (6) and transformed according Eq. (5). The dimensions of the resulting\ntraining and testing kernel matrices are shown in Table 1.\n\nTable 1: Dimensions of the training and testing kernel matrices of both new probablisitic\nkernels on HUB, KING, and COREL databases.\n\nHUB\n\nTraining\n1198x1198\n\nHUB\nTesting\n\n15325x1198\n\nKING\nTraining\n200x200\n\nKING\nTesting\n300x200\n\nCOREL\nTraining\n528x528\n\nCOREL\nTesting\n272x528\n\nIn the Fisher kernel experiments we computed the Fisher score vector UX for each training\nand testing utterance and image with \u03b8 parameter based on the prior probabilities of each\nmixture Gaussian. The underlying generative model was the same one used for the GMM\nclassi\ufb01cation experiments.\n\nThe task of speaker veri\ufb01cation is different from speaker identi\ufb01cation. We make a binary\ndecision of whether or not an unknown utterance is spoken by the person of the claimed\nidentity. Because we have trained SVM\u2019s using the one-vs-all approach their output can\nbe directly used in speaker veri\ufb01cation. To verify whether the utterance belongs to class\nA we just use the A-vs-all SVM output. On the other hand, the scores of the GMM and\nAHS classi\ufb01ers cannot be used directly for veri\ufb01cation experiments. We need to somehow\ncombine the scores from the non claimed identities, i.e., if we want to verify whether an\nutterance belongs to speaker A we need to compute a model for non-A speakers. This non-\nclass model can be computed by \ufb01rst pooling the 49 non-class GMM\u2019s together to form a\nsuper GMM with 256x49 mixtures, (each speaker GMM has 256 mixtures). Then the score\nproduced by this super GMM is subtracted from the score produced by the claimed speaker\nGMM. In the case of AHS classi\ufb01ers we estimate the non-class score as the arithmetic mean\nof the other 49 speaker scores. To compute the miss and false positive rates we compare the\n\n1Arithmetic harmonic sphericity classi\ufb01ers pull together all vectors belonging to a class and \ufb01t a\nsingle full covariance Gaussian model to the data. Similarly, a single full covariance model is \ufb01tted to\neach testing utterance. The similarity between the testing utterances and the class models is measured\naccording to Eq. (6). The class with the minimum distance is chosen as the winning class.\n\n\fdecision scores to a threshold \u0398. By varying \u0398 we can compute Detection Error Tradeoff\n(DET) curves as the ones shown in Fig. 1.\n\nWe compare the performance of all the 5 classi\ufb01ers in speaker veri\ufb01cation and speaker\nidenti\ufb01cation tasks. Table 2 shows equal-error rates (EER\u2019s) for speaker veri\ufb01cation and\naccuracies of speaker identi\ufb01cation for both speech corpora.\n\nTable 2: Comparison of all the classi\ufb01ers used on the HUB and KING corpora. Both clas-\nsi\ufb01cation accuracy (Acc) and equal error rates (EER) are reported in percentage points.\n\nType of\nClassi\ufb01er\n\nGMM\nAHS\n\nSVM Fisher\n\nSVM GMM/KL\nSVM COV/KL\n\nHUB HUB KING KING\nEER\nAcc\n16.1\n87.4\n26.8\n81.7\n62.4\n12.3\n7.9\n83.8\n84.7\n6.6\n\nEER\n8.1\n9.1\n14.0\n7.8\n7.4\n\nAcc\n68.0\n48.3\n48.0\n72.7\n79.7\n\nWe also compared the performance of 4 classi\ufb01ers in the image classi\ufb01cation task. Since\nthe AHS classi\ufb01er is not a effective image classi\ufb01er we excluded it here. Table 3 shows the\nclassi\ufb01cation accuracies.\n\nTable 3: Comparison of the 4 classi\ufb01ers used on the COREL animal subset. Classi\ufb01cation\naccuracies are reported in percentage points.\n\nType of\nClassi\ufb01er\n\nAccuracy\n\nGMM\n\nSVM Fisher\n\nSVM GMM/KL\nSVM COV/KL\n\n82.0\n73.5\n85.3\n80.9\n\nOur results using the KL divergence based kernels in both multimedia data types are quite\npromising. In the case of the HUB experiments all classi\ufb01ers perform similarly in both\nspeaker veri\ufb01cation and identi\ufb01cation tasks with the exception of the SVM Fisher which\nperforms signi\ufb01cantly worse. However, For the KING database, we can see that our KL\nbased SVM kernels outperform all other classi\ufb01ers in both identi\ufb01cation and veri\ufb01cation\ntasks. Interestingly the Fisher kernel performs quite poorly too. Looking at the DET plots\nfor both corpora we can see that on the HUB experiments the new SVM kernels perform\nquite well and on the KING corpora they perform much better than any other veri\ufb01cation\nsystem.\n\nIn image classi\ufb01cation experiments with the COREL database both KL based SVM kernels\noutperform the Fisher SVM; the GMM/KL kernel even outperforms the baseline GMM\nclassi\ufb01er.\n\n6 Conclusion and Future Work\n\nIn this paper we have proposed a new method of combining generative models and dis-\ncriminative classi\ufb01ers (SVM\u2019s). Our approach is extremely simple. For every multimedia\nobject represented by a sequence of vectors, a PDF is learned using maximum likelihood\napproaches. We have experimented with PDF\u2019s that are commonly used in the multimedia\n\n\fFigure 1: Speaker veri\ufb01cation detection error tradeoff (DET) curves for the HUB and the\nKING corpora, tested on all 50 speakers.\n\ncommunity. However, the method is generic enough and could be used with any PDF. In the\ncase of GMM\u2019s we use the EM algorithm to learn the model parameters \u03b8. In the case of a\nsingle full covariance Gaussian we directly estimate its parameters. We then introduce the\nidea of computing kernel distances via a direct comparison of PDF\u2019s. In effect we replace\nthe standard kernel distance on the original data K(Xi, Xj) by a new kernel derived from\nthe symmetric Kullback-Leibler (KL) divergence K(Xi, Xj) \u2212\u2192 K(p(x|\u03b8i), p(x|\u03b8j)).\nAfter that a kernel matrix is computed and a traditional SVM can be used.\n\nIn our experiments we have validated this new approach in speaker identi\ufb01cation, veri\ufb01ca-\ntion, and image classi\ufb01cation tasks by comparing its performance to Fisher kernel SVM\u2019s\nand other well-known classi\ufb01cation algorithms: GMM and AHS methods. Our results\nshow that our new method of combining generative models and SVM\u2019s always outper-\nform the SVM Fisher kernel and the AHS methods, and it often outperforms other clas-\nsi\ufb01cation methods such as GMM\u2019s and AHS. The equal error rates are consistently better\nwith the new kernel SVM methods too. In the case of image classi\ufb01cation our GMM/KL\ndivergence-based kernel has the best performance among the four classi\ufb01ers while our sin-\ngle full covariance Gaussian distance based kernel outperforms most other classi\ufb01ers and\nonly do slightly worse than the baseline GMM. All these encouraging results show that\nSVM\u2019s can be improved by paying careful attention to the nature of the data being mod-\neled. In both audio and image tasks we just take advantage of previous years of research in\ngenerative methods.\n\nThe good results obtained using a full covariance single Gaussian KL kernel also make\nour algorithm a very attractive alternative as opposed to the more complex methods of\ntuning system parameters and combining generative classi\ufb01ers and discriminative methods\nsuch as the Fisher SVM. This full covariance single Gaussian KL kernel\u2019s performance\nis consistently good across all databases. It is especially simple and fast to compute and\nrequires no tuning of system parameters.\n\nWe feel that this approach of combining generative classi\ufb01ers via KL divergences of derived\nPDF\u2019s is quite generic and can possibly be applied to other domains. We plan to explore its\nuse in other multimedia related tasks.\n\nReferences\n\n[1] Vapnik, V., Statistical learning theory, John Wiley and Sons, New York, 1998.\n\n00.050.10.150.20.250.30.350.400.050.10.150.20.250.30.350.4HUB DETsP(false positive)P(miss)GMM NG=256AHS SVM FisherSVM GMM/KLSVM COV/KL00.050.10.150.20.250.30.350.400.050.10.150.20.250.30.350.4KING DETsP(false positive)P(miss)\f[2] Wan, V. and Campbell, W., \u201cSupport vector machines for speaker veri\ufb01cation and identi\ufb01ca-\n\ntion,\u201d IEEE Proceeding, 2000.\n\n[3] Chapelle, O. and Haffner, P. and Vapnik V., \u201cSupport vector machines for histogram-based\nimage classi\ufb01cation,\u201d IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1055\u20131064,\nSeptember 1999.\n\n[4] Jaakkola, T., Diekhans, M. and Haussler, D., \u201cUsing the \ufb01sher kernel method to detect remote\nprotein homologies,\u201d in Proceedings of the Internation Conference on Intelligent Systems for\nMolecular Biology, Aug. 1999.\n\n[5] Moreno, P. J. and Rifkin, R., \u201cUsing the \ufb01sher kernel method for web audio classi\ufb01cation,\u201d\n\nICASSP, 2000.\n\n[6] Smith N., Gales M., and Niranjan M., \u201cData dependent kernels in SVM classi\ufb01cation of speech\npatterns,\u201d Tech. Rep. CUED/F-INFENG/TR.387,Cambridge University Engineering Depart-\nment, 2001.\n\n[7] Vasconcelos, N. and Lippman, A., \u201cA unifying view of image similarity,\u201d IEEE International\n\nConference on Pattern Recognition, 2000.\n\n[8] Bimbot, F., Magrin-Chagnolleau, I. and Mathan, L., \u201cSecond-order statistical measures for\ntext-independent speaker identi\ufb01cation,\u201d Speech Communication, vol. 17, pp. 177\u2013192, 1995.\n[9] Stern, R. M., \u201cSpeci\ufb01cation of the 1996 HUB4 Broadcast News Evaluation,\u201d in DARPA Speech\n\nRecognition Workshop, 1997.\n\n[10] \u201cThe KING Speech Database,\u201d http://www.ldc.upenn.edu/Catalog/docs/LDC95S22/ kingdb.txt.\n[11] Chen K., \u201cTowards better making a decision in speaker veri\ufb01cation,\u201d Pattern Recognition, , no.\n\n36, pp. 329\u2013246, 2003.\n\n[12] \u201cCorel stock photos,\u201d http://elib.cs.berleley.edu/photos/blobworld/cdlist.html.\n\n\f", "award": [], "sourceid": 2351, "authors": [{"given_name": "Pedro", "family_name": "Moreno", "institution": null}, {"given_name": "Purdy", "family_name": "Ho", "institution": null}, {"given_name": "Nuno", "family_name": "Vasconcelos", "institution": null}]}