{"title": "Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1678, "page_last": 1686, "abstract": "We consider the problem of using nearest neighbor methods to provide a conditional probability estimate, P(y|a), when the number of labels y is large and the labels share some underlying structure. We propose a method for learning error-correcting output codes (ECOCs) to model the similarity between labels within a nearest neighbor framework. The learned ECOCs and nearest neighbor information are used to provide conditional probability estimates. We apply these estimates to the problem of acoustic modeling for speech recognition. We demonstrate an absolute reduction in word error rate (WER) of 0.9% (a 2.5% relative reduction in WER) on a lecture recognition task over a state-of-the-art baseline GMM model.", "full_text": "Learning Label Embeddings for Nearest-Neighbor\nMulti-class Classi\ufb01cation with an Application to\n\nSpeech Recognition\n\nNatasha Singh-Miller\n\nCambridge, MA\n\nMichael Collins\n\nCambridge, MA\n\nMassachusetts Institute of Technology\n\nMassachusetts Institute of Technology\n\nnatashas@mit.edu\n\nmcollins@csail.mit.edu\n\nAbstract\n\nWe consider the problem of using nearest neighbor methods to provide a condi-\ntional probability estimate, P (y|a), when the number of labels y is large and the\nlabels share some underlying structure. We propose a method for learning label\nembeddings (similar to error-correcting output codes (ECOCs)) to model the sim-\nilarity between labels within a nearest neighbor framework. The learned ECOCs\nand nearest neighbor information are used to provide conditional probability esti-\nmates. We apply these estimates to the problem of acoustic modeling for speech\nrecognition. We demonstrate signi\ufb01cant improvements in terms of word error rate\n(WER) on a lecture recognition task over a state-of-the-art baseline GMM model.\n\n1 Introduction\n\nRecent work has focused on the learning of similarity metrics within the context of nearest-neighbor\n(NN) classi\ufb01cation [7, 8, 12, 15]. These approaches learn an embedding (for example a linear\nprojection) of input points, and give signi\ufb01cant improvements in the performance of NN classi\ufb01ers.\n\nIn this paper we focus on the application of NN methods to multi-class problems, where the number\nof possible labels is large, and where there is signi\ufb01cant structure within the space of possible labels.\nWe describe an approach that induces prototype vectors My \u2208 \u211cL (similar to error-correcting\noutput codes (ECOCs)) for each label y, from a set of training examples {(ai, yi)} for i = 1 . . . N.\nThe prototype vectors are embedded within a NN model that estimates P (y|a); the vectors are\nlearned using a leave-one-out estimate of conditional log-likelihood (CLL) derived from the training\nexamples. The end result is a method that embeds labels y into \u211cL in a way that signi\ufb01cantly\nimproves conditional log-likelihood estimates for multi-class problems under a NN classi\ufb01er.\n\nThe application we focus on is acoustic modeling for speech recognition, where each input a \u2208 \u211cD\nis a vector of measured acoustic features, and each label y \u2208 Y is an acoustic-phonetic label. As\nis common in speech recognition applications, the size of the label space Y is large (in our ex-\nperiments we have 1871 possible labels), and there is signi\ufb01cant structure within the labels: many\nacoustic-phonetic labels are highly correlated or confusable, and many share underlying phonolog-\nical features. We describe experiments measuring both conditional log-likelihood of test data, and\nword error rates when the method is incorporated within a full speech recogniser. In both settings the\nexperiments show signi\ufb01cant improvements for the ECOC method over both baseline NN methods\n(e.g., the method of [8]), as well as Gaussian mixture models (GMMs), as conventionally used in\nspeech recognition systems.\n\nWhile our experiments are on speech recognition, the method should be relevant to other domains\nwhich involve large multi-class problems with structured labels\u2014for example problems in natural\nlanguage processing, or in computer vision (e.g., see [14] for a recent use of neighborhood com-\n\n1\n\n\fponents analysis (NCA) [8] within an object-recognition task with a very large number of object\nlabels). We note also that the approach is relatively ef\ufb01cient: our model is trained on around 11\nmillion training examples.\n\n2 Related Work\n\nSeveral pieces of recent work have considered the learning of feature space embeddings with the\ngoal of optimizing the performance of nearest-neighbor classi\ufb01ers [7, 8, 12, 15]. We make use of\nthe formalism of [8] as the starting point in our work. The central contrast between our work and\nthis previous work is that we learn an embedding of the labels in a multi-class problem; as we will\nsee, this gives signi\ufb01cant improvements in performance when nearest-neighbor methods are applied\nto multi-class problems arising in the context of speech recognition.\n\nOur work is related to previous work on error-correcting output codes for multi-class problems.\n[1, 2, 4, 9] describe error-correcting output codes; more recently [2, 3, 11] have described algorithms\nfor learning ECOCs. Our work differs from previous work in that ECOC codes are learned within\na nearest-neighbor framework. Also, we learn the ECOC codes in order to model the underlying\nstructure of the label space and not speci\ufb01cally to combine the results of multiple classi\ufb01ers.\n\n3 Background\n\nThe goal of our work is to derive a model that estimates P (y|a) where a \u2208 \u211cD is a feature vector\nrepresenting some input, and y is a label drawn from a set of possible labels Y. The parameters of\nour model are estimated using training examples {(a1, y1), ..., (aN , yN )}. In general the training\ncriterion will be closely related to the conditional log-likelihood of the training points:\n\nN\n\nXi=1\n\nlog P (yi|ai)\n\nWe choose to optimize the log-likelihood rather than simple classi\ufb01cation error, because these esti-\nmates will be applied within a larger system, in our case a speech recognizer, where the probabilities\nwill be propagated throughout the recognition model; hence it is important for the model to provide\nwell-calibrated probability estimates.\n\nFor the speech recognition application considered in this paper, Y consists of 1871 acoustic-phonetic\nclasses that may be highly correlated with one another. Leveraging structure in the label space will\nbe crucial to providing good estimates of P (y|a); we would like to learn the inherent structure\nof the label space automatically. Note in addition that ef\ufb01ciency is important within the speech\nrecognition application: in our experiments we make use of around 11 million training samples,\nwhile the dimensionality of the data is D = 50.\nIn particular, we will develop nearest-neighbor methods that give an ef\ufb01cient estimate of P (y|a).\nAs a \ufb01rst baseline approach\u2014and as a starting point for the methods we develop\u2014consider the\nneighbor components analysis (NCA) method introduced by [8]. In NCA, for any test point a, a\ndistribution \u03b1(j|a) over the training examples is de\ufb01ned as follows where \u03b1(j|a) decreases rapidly\nas the distance between a and aj increases.\n\n\u03b1(j|a) =\n\ne\u2212||a\u2212aj ||2\nPN\nm=1 e\u2212||a\u2212am||2\n\nThe estimate of P (y|a) is then de\ufb01ned as follows:\n\nPnca(y|a) =\n\nN\n\nXi=1,yi=y\n\n\u03b1(i|a)\n\n2\n\n(1)\n\n(2)\n\n\fIn NCA the original training data consists of points (xi, yi) for i = 1 . . . N, where xi \u2208 \u211cD\u2032,\nwith D\u2032 typically larger than D. The method learns a projection matrix A that de\ufb01nes the modi\ufb01ed\nrepresentation ai = Axi (the same transformation is applied to test points). The matrix A is learned\nfrom training examples, to optimize log-likelihood under the model in Eq. 2.\n\nIn our experiments we assume that a = Ax for some underlying representation x and a projection\nmatrix A that has been learned using NCA to optimize the log-likelihood of the training set. As\na result the matrix A, and consequently the representation a, are well-calibrated in terms of using\nnearest neighbors to estimate P (y|a) through Eq. 2. A \ufb01rst baseline method for our problem is\ntherefore to directly use the estimates de\ufb01ned by Eq. 2.\n\nWe will, however, see that this baseline method performs poorly at providing estimates of P (y|a)\nwithin the speech recognition application. Importantly, the model fails to exploit the underlying\nstructure or correlations within the label space. For example, consider a test point that has many\nneighbors with the phonemic label /s/. This should be evidence that closely related phonemes,\n/sh/ for instance, should also get a relatively high probability under the model, but the model is\nunable to capture this effect.\n\nAs a second baseline, an alternative method for estimating P (y|a) using nearest neighbor informa-\ntion is the following:\n\nPk(y|a) =\n\n# of k-nearest neighbors of a in training set with label y\n\nk\n\nHere the choice of k is crucial. A small k will be very sensitive to noise and necessarily lead to\nmany classes receiving a probability of zero, which is undesirable for our application. On the other\nhand, if k is too large, samples from far outside the neighborhood of a will in\ufb02uence Pk(y|a). We\nwill describe a baseline method that interpolates estimates from several different values of k. This\nbaseline will be useful with our approach, but again suffers from the fact that it does not model the\nunderlying structure of the label space.\n\n4 Error-Correcting Output Codes for Nearest-Neighbor Classi\ufb01ers\n\nWe now describe a model that uses error correcting output codes to explicitly represent and learn the\nunderlying structure of the label space Y. For each label y, we de\ufb01ne My \u2208 \u211cL to be a prototype\nvector. We assume that the inner product hMy, Mzi will in some sense represent the similarity\nbetween labels y and z. The vectors My will be learned automatically, effectively representing an\nembedding of the labels in \u211cL. In this section we \ufb01rst describe the structure of the model, and then\ndescribe a method for training the parameters of the model (i.e., learning the prototype vectors My).\n\n4.1 ECOC Model\n\nThe ECOC model is de\ufb01ned as follows. When considering a test sample a, we \ufb01rst assign weights\n\u03b1(j|a) to points aj from the training set through the NCA de\ufb01nition in Eq. 1. Let M be a matrix\nthat contains all the prototype vectors My as its rows. We can then construct a vector H(a; M) that\nuses the weights \u03b1(j|a) and the true labels of the training samples to calculate the expected value of\nthe output code representing a.\n\nH(a; M) =\n\nN\n\nXj=1\n\n\u03b1(j|a)Myj\n\nGiven this de\ufb01nition of H(a; M), our estimate under the ECOC model is de\ufb01ned as follows:\n\nPecoc(y|a; M) =\n\nehMy ,H(a;M)i\n\ny\u2032 ,H(a;M)i\n\nPy\u2032\u2208Y ehM\n\n3\n\n\fL\n2\n10\n20\n30\n40\n50\n60\n\naverage CLL\n\n-4.388\n-2.748\n-2.580\n-2.454\n-2.432\n-2.470\n-2.481\n\nTable 1: Average CLL achieved by Pecoc over DevSet1 for different values of L\n\nThis distribution assigns most of the probability for a sample vector a to classes whose proto-\ntype vectors have a large inner product with H(a; M). All labels receive a non-zero weight under\nPecoc(y|a; M).\n\n4.2 Training the ECOC Model\n\nWe now describe a method for estimating the ECOC vectors My in the model. As in [8] the method\nuses a leave-one-out optimization criterion, which is particularly convenient within nearest-neighbor\napproaches. The optimization problem will be to maximize the conditional log-likelihood function\n\nF (M) =\n\nN\n\nXi=1\n\nlog P (loo)\n\necoc (yi|ai; M)\n\nwhere P (loo)\necoc (yi|ai; M) is a leave-one-out estimate of the probability of label yi given the input\nai, assuming an ECOC matrix M. This criterion is related to the classi\ufb01cation performance of the\ntraining data and also discourages the assignment of very low probability to the correct class.\n\nThe estimate P (loo)\n\necoc (yi|ai; M) is given through the following de\ufb01nitions:\n\n\u03b1(loo)(j|i) =\n\ne\u2212||ai\u2212aj ||2\n\nPN\nm=1,m6=i e\u2212||ai\u2212am||2\n\nif i 6= j and 0 otherwise\n\nH (loo)(ai; M) =\n\nP (loo)\n\necoc (y|ai; M) =\n\nN\n\nXj=1\n\n\u03b1(loo)(j|i)Myj\n\nehMy ,H (loo)(a;M)i\n\ny\u2032 ,H (loo)(a;M)i\n\nPy\u2032\u2208Y ehM\n\nThe criterion F (M) can be optimized using gradient-ascent methods, where the gradient is as fol-\nlows:\n\n\u2202F (M)\n\n\u2202Mz\n\n= \u2207(z) \u2212 \u2207\u2032(z)\n\n\u2207(z) =\n\nN\n\nN\n\nXi=1\n\nXj=1\n\n[\u03b1(loo)(j|i)(\u03b4z,yi\n\nMyj + \u03b4yj ,zMyi )]\n\n\u2207\u2032(z) =\n\nN\n\nXi=1 Xy\u2032\u2208Y\n\nP (loo)\n\necoc (y\u2032|ai; M)\uf8eb\n\uf8ed\n\nN\n\nXj=1\n\n[\u03b1(loo)(j|i)(\u03b4z,y\u2032Myj + \u03b4yj ,zMy\u2032 )]\uf8f6\n\uf8f8\n\n4\n\n\fModel Average CLL on DevSet 1\nPnca\nPnn\nPecoc\nPf ull\nPgmm\nPmix\n\n-2.657\n-2.535\n-2.432\n-2.337\n-2.299\n-2.165\n\nPerplexity\n\n14.25\n12.61\n11.38\n10.35\n9.96\n8.71\n\nTable 2: Average conditional log-likelihood (CLL) of Pnca, Pnn, Pecoc, Pnn\u2032, Pgmm and Pmix on\nDevSet1. The corresponding perplexity values are indicated as well where the perplexity is de\ufb01ned\nas e\u2212x given that x is the average CLL.\n\nHere \u03b4a,b = 1 if a = b and \u03b4a,b = 0 if a 6= b. Since \u03b1(loo)(j|i) will be very small if ||ai \u2212 aj||2 is\nlarge, the gradient calculation can be truncated for such pairs of points which signi\ufb01cantly improves\nthe ef\ufb01ciency of the method (a similar observation is used in [8]). This optimization is non-convex\nand it is possible to converge to a local optimum.\n\nIn our experiments we learn the matrix M using conjugate gradient ascent, though alternatives such\nas stochastic gradient can also be used. A random initialization of M is used for each experiment.\nWe select L = 40 as the length of the prototype vectors My. We experimented with different\nvalues of L. The average conditional log-likelihood achieved on a development set of approximately\n115,000 samples (DevSet1) is listed in Table 1. The performance of the method improves initially\nas the size of L increases, but the objective levels off around L = 40.\n\n5 Experiments on Log-Likelihood\n\nWe test our approach on a large-vocabulary lecture recognition task [6]. This is a challenging task\nthat consists of recognizing college lectures given by multiple speakers. We use the SUMMIT\nrecognizer [5] that makes use of 1871 distinct class labels. The acoustic vectors we use are 112\ndimensional vectors consisting of eight concatenated 14 dimensional vectors of MFCC measure-\nments. These vectors are projected down to 50 dimensions using NCA as described in [13]. This\nsection describes experiments comparing the ECOC model to several baseline models in terms of\ntheir performance on the conditional log-likelihood of sample acoustic vectors.\nThe baseline model, Pnn, makes use of estimates Pk(y|a) as de\ufb01ned in section 3. The set K is a set\nof integers representing different values for k, the number of nearest neighbors used to evaluate Pk.\nAdditionally, we assume d functions over the the labels, P1(y), ..., Pd(y). (More information on the\nfunctions Pj(y) that we use in our experiments can be found in the appendix. We have found these\nfunctions over the labels are useful within our speech recognition application.) The model is then\nde\ufb01ned as\n\nd\n\n\u03bb0\n\nj Pj(y)\n\n\u03bbkPk(y|a) +\n\nPnn(y|a; \u00af\u03bb) = Xk\u2208K\nj \u2265 0 for j = 1, ..., d, and Pk\u2208K \u03bbk +Pd\n\nXj=1\n\nj = 1. The \u00af\u03bb values were\nwhere \u03bbk \u2265 0, \u2200k \u2208 K, \u03bb0\nestimated using the EM algorithm on a validation set of examples (DevSet2). In our experiments,\nwe select K = {5, 10, 20, 30, 50, 100, 250, 500, 1000}. Table 2 contains the average conditional log-\nlikelihood achieved on a development set (DevSet1) by Pnca, Pnn and Pecoc. These results show\nthat Pecoc clearly outperforms these two baseline models.\nIn a second experiment we combined Pecoc with Pnn to create a third model Pf ull(y|a). This model\nincludes information from the nearest neighbors, the output codes, as well as the distributions over\nthe label space. The model takes the following form:\n\nj=1 \u03bb0\n\nPf ull(y|a; \u00af\u03bb) = Xk\u2208K\n\n\u03bbkPk(y|a) +\n\nd\n\nXj=1\n\n5\n\n\u03bb0\n\nj Pj(y) + \u03bbecocPecoc(y|a; M)\n\n\fAcoustic Model WER (DevSet3) WER (Test Set)\nBaseline Model\n\n36.3\n35.2\n\n35.4\n34.5\n\nAugmented Model\n\nTable 3: WER of recognizer for different acoustic models on the development and test set.\n\nThe values of \u00af\u03bb here have similar constraints as before and are again optimized using the EM algo-\nrithm. Results in Table 2 show that this model gives a further clear improvement over Pecoc.\nWe also compare ECOC to a GMM model, as conventionally used in speech recognition systems.\nThe GMM we use is trained using state-of-the-art algorithms with the SUMMIT system [5]. The\nGMM de\ufb01nes a generative model Pgmm(a|y); we derive a conditional model as follows:\n\nPgmm(y|a) =\n\nPgmm(a|y)\u03b1P (y)\n\nPy\u2032\u2208Y Pgmm(a|y\u2032)\u03b1P (y\u2032)\n\nThe parameter \u03b1 is selected experimentally to achieve maximum CLL on DevSet2 and P (y) refers\nto the prior over the labels calculated directly from their relative proportions in the training set.\nTable 2 shows that Pf ull and Pgmm are close in performance, with Pgmm giving slightly improved\nresults. A \ufb01nal interpolated model with similar constraints on the values of \u00af\u03bb trained using the EM\nalgorithm is as follows:\n\nPmix(y|a; \u00af\u03bb) = Xk\u2208K\n\n\u03bbkPk(y|a) +\n\nd\n\nXj=1\n\n\u03bb0\n\nj Pj(y) + \u03bbecocPecoc(y|a; M) + \u03bbgmmPgmm(y|a)\n\nResults for Pmix are shown in the \ufb01nal row in the table. This interpolated model gives a clear\nimprovement over both the GMM and ECOC models alone. Thus the ECOC model, combined with\nadditional nearest-neighbor information, can give a clear improvement over state-of-the-art GMMs\non this task.\n\n6 Recognition Experiments\n\nIn this section we describe experiments that integrate the ECOC model within a full speech recog-\nnition system. We learn parameters \u00af\u03bb using both DevSet1 and DevSet2 for Pf ull(y|a). However,\nwe need to derive an estimate for P (a|y) for use by the recognizer. We can do so by using an esti-\nmate for P (a|y) proportional to P (y|a)\n[16]. The estimates for P (y) are derived directly from the\nP (y)\nproportions of occurrences of each acoustic-phonetic class in the training set.\n\nIn our experiments we consider the following two methods for calculating the acoustic model.\n\n\u2022 Baseline Model: \u03b21 log Pgmm(a|y)\n\u2022 Augmented Model: \u03b22 log(cid:16) \u03b3Pgmm(y|a)+(1\u2212\u03b3)Pf ull(y|a)\n\nP (y)\n\n(cid:17)\n\nThe baseline method is just a GMM model with the commonly used scaling parameter \u03b21. The\naugmented model combines Pgmm linearly with Pf ull using parameter \u03b3 and the log of the combi-\nnation is scaled by parameter \u03b22. The parameters \u03b21, \u03b22, \u03b3 are selected using the downhill simplex\nalgorithm by optimizing WER over a development set [10]. Our development set (DevSet3) consists\nof eight hours of data including six speakers and our test set consists of eight hours of data including\n\ufb01ve speakers. Results for both methods on the development set and test set are presented in Table 3.\n\nThe augmented model outperforms the baseline GMM model. This indicates that the nearest neigh-\nbor information along with the ECOC embedding, can signi\ufb01cantly improve the acoustic model.\nOverall, an absolute reduction of 1.1% in WER on the development set and 0.9% on the test set are\nachieved using the augmented acoustic model. These results are signi\ufb01cant with p < 0.001 using\nthe sign test calculated at the utterance level.\n\n6\n\n\f4.0\ns\n\nz\n\nch\n\njh\n\nsh\n\nzh\n\nt\n\nepi\n\np\nth\n\nf\nk\n\ntcl\nkcl\npcl\n\ngcl\n\ndcl\nbcl\nv\n\ndh\n\nhh\n\n4.0\n\nay\n\naa\n\nao\n\nb\n\noy\n\nah\n\nuh\n\naxr\n\naw\nel\now\nl\nahf p\nr\ner\n\neh\n\nae\n\nw\naw\n\nuw\n\nih\ney\n\nem\n\ng\nd\n\ndx\n\ny\n\nen\n\nng\n\nm\n\nn\n\niy\n\nFigure 1: Plot of 2-dimensional output codes corresponding to 73 acoustic phonetic classes. The\nred circles indicate noise and silence classes. The phonemic classes are divided as follows: vowels,\nsemivowels, nasals, stops and stop closures, fricatives, affricates, and the aspirant /hh/.\n\n7 Discussion\n\n7.1 Plot of a low-dimensional embedding\n\nIn order to get a sense of what is learned by the output codes of Pecoc we can plot the output codes\ndirectly. Figure 1 shows a plot of the output codes learned when L = 2. The output codes are learned\nfor 1871 classes, but only 73 internal acoustic-phonetic classes are shown in the plot for clarity. In\nthe plot, classes of similar acoustic-phonetic category are shown in the same color and shape. We can\nsee that items of similar acoustic categories are grouped closely together. For example, the vowels\nare close to each other in the bottom left quadrant, while the stop-closures are grouped together in\nthe top right, the affricates in the top left, and the nasals in the bottom right. The fricatives are a\nlittle more spread out but usually grouped close to another fricative that shares some underlying\nphonological feature such as /sh/ and /zh/ which are both palatal and /f/ and /th/ which are\nboth unvoiced. We can also see speci\ufb01c acoustic properties emerging. For example the voiced stops\n/b/, /d/, /g/ are placed close to other voiced items of different acoustic categories.\n\n7\n\n\f7.2 Extensions\n\nThe ECOC embedding of the label space could also be co-learned with an embedding of the input\nacoustic vector space by extending the approach of NCA [8]. It would simply require the reintro-\nduction of the projection matrix A in the weights \u03b1.\n\n\u03b1(j|x) =\n\ne\u2212||Ax\u2212Axj ||2\nPN\nm=1 e\u2212||Ax\u2212Axm||2\n\nH(x; M) and Pecoc would still be de\ufb01ned as in section 4.1. The optimization criterion would now\ndepend on both A and M. To optimize A, we could again use gradient methods. Co-learning the\ntwo embeddings M and A could potentially lead to further improvements.\n\n8 Conclusion\n\nWe have shown that nearest neighbor methods can be used to improve the performance of a GMM-\nbased acoustic model and reduce the WER on a challenging speech recognition task. We have\nalso developed a model for using error-correcting output codes to represent an embedding of the\nacoustic-phonetic label space that helps us capture cross-class information. Future work on this task\ncould include co-learning an embedding of the input acoustic vector space with the ECOC matrix to\nattempt to achieve further gains.\n\nAppendix\n\nWe de\ufb01ne three distributions based on the prior probabilities, P (y), of the acoustic phonetic classes.\nThe SUMMIT recognizer makes use of 1871 distinct acoustic phonetic labels [5]. We divide the set\nof labels, Y, into three disjoint categories.\n\n\u2022 Y (1) includes labels involving internal phonemic events (e.g. /ay/)\n\n\u2022 Y (2) includes labels involving the transition from one acoustic-phonetic event to another\n\n(e.g. /ow/->/ch/)\n\n\u2022 Y (3) includes labels involving only non-phonetic events like noise and silence\n\nWe de\ufb01ne a distribution P (1)(y) as follows. Distributions P (2)(y) and P (3)(y) are de\ufb01ned similarly.\n\nP (1)(y) =\n\n(cid:26) P (y),\n\n0,\n\nif y \u2208 Y (1)\notherwise\n\nPy\u2032\u2208Y (1) P (y\u2032)\n\nReferences\n\n[1] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying ap-\n\nproach for margin classi\ufb01ers. Journal of Machine Learning Research, 1:113\u2013141, 2000.\n\n[2] K. Crammer and Y. Singer. Improved output coding for classi\ufb01cation using continuous relax-\n\nation. In Advances in Neural Information Processing Systems. MIT Press, 2000.\n\n[3] K. Crammer and Y. Singer. On the learnability and design of output codes for multiclass\n\nproblems. Machine Learning, 47(2-3):201\u2013233, 2002.\n\n[4] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output\n\ncodes. Journal of Arti\ufb01cial Intelligence Research, 2:263\u2013286, 1995.\n\n[5] J. Glass. A probabilistic framework for segment-based speech recognition. Computer, Speech,\n\nand Language, 17(2-3):137\u2013152, 2003.\n\n8\n\n\f[6] J. Glass, T. J. Hazen, L. Hetherington, and C. Wang. Analysis and processing of lecture\naudio data: Preliminary investigations. In HLT-NAACL 2004 Workshop on Interdisciplinary\nApproaches to Speech Indexing and Retrieval, pages 9\u201312, 2004.\n\n[7] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Y. Weiss, B. Scholkopf,\nand J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 513\u2013520.\nMIT Press, 2006.\n\n[8] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analy-\nsis. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing\nSystems 17, pages 513\u2013520. MIT Press, 2005.\n\n[9] A. Klautau, N. Jevtic, and A. Orlitsky. On nearest-neighbor error-correcting output codes\nwith aplication to all-pairs multiclass support vector machines. Journal of Machine Learning\nResearch, 4:1\u201315, 2003.\n\n[10] W. H. Press, S. A. Teukolsky, W. T. Vetterline, and B. P. Flannery. Numerical recipes: the art\n\nof scienti\ufb01c computing. Cambridge University Press, 3 edition, 2007.\n\n[11] O. Pujol, P. Radeva, and J. Vitria. Discriminant ecoc: a heuristic method for application\ndependent design of error correcting output codes. IEEE Transactions of Pattern Analysis and\nMachine Intelligence, 28(6), 2006.\n\n[12] R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neigh-\n\nbourhood structure. AI and Statistics, 2007.\n\n[13] N. Singh-Miller, M. Collins, and T. J. Hazen. Dimensionality reduction for speech recognition\n\nusing neighborhood components analysis. In Interspeech, 2007.\n\n[14] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition.\n\nIEEE Computer Vision and Pattern Recognition, June 2008.\n\n[15] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest\nneighbor classi\ufb01cation. In Advances in Neural Information Processing Systems. MIT Press,\n2006.\n\n[16] G. Zavaliagkos, Y. Zhao, R. Schwartz, and J. Makhoul. A hybrid segmental neural net/hidden\nmarkov model system for continuous speech recognition. IEEE Transactions on Speech and\nAudio Processing, 2(1):151\u2013160, 1994.\n\n9\n\n\f", "award": [], "sourceid": 441, "authors": [{"given_name": "Natasha", "family_name": "Singh-miller", "institution": null}, {"given_name": "Michael", "family_name": "Collins", "institution": null}]}