{"title": "Family Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 402, "page_last": 408, "abstract": null, "full_text": "Family Discovery \n\nStephen M. Omohundro \n\nNEC Research Institute \n\n4 Independence Way, Princeton, N J 08540 \n\nom@research.nj.nec.com \n\nAbstract \n\n\"Family discovery\" is the task of learning the dimension and struc(cid:173)\nture of a parameterized family of stochastic models. It is espe(cid:173)\ncially appropriate when the training examples are partitioned into \n\"episodes\" of samples drawn from a single parameter value. We \npresent three family discovery algorithms based on surface learn(cid:173)\ning and show that they significantly improve performance over two \nalternatives on a parameterized classification task. \n\n1 \n\nINTRODUCTION \n\nHuman listeners improve their ability to recognize speech by identifying the accent \nof the speaker. \"Might\" in an American accent is similar to \"mate\" in an Australian \naccent. By first identifying the accent, discrimination between these two words is \nimproved. We can imagine locating a speaker in a \"space of accents\" parameterized \nby features like pitch, vowel formants, \"r\" -strength, etc. This paper considers the \ntask of learning such parameterized models from data. \n\nMost speech recognition systems train hidden Markov models on labelled speech \ndata. Speaker-dependent systems train on speech from a single speaker. Speaker(cid:173)\nindependent systems are usually similar, but are trained on speech from many \ndifferent speakers in the hope that they will then recognize them all. This kind of \ntraining ignores speaker identity and is likely to result in confusion between pairs of \nwords which are given the same pronunciation by speakers with different accents. \n\nSpeaker-independent recognition systems could more closely mimic the human ap(cid:173)\nproach by using a learning paradigm we call \"family discovery\". The system would \nbe trained on speech data partitioned into \"episodes\" for each speaker. From this \ndata, the system would construct a parameterized family of models representing dif-\n\n\fFamily Discovery \n\n403 \n\nAffine \nFamily \n\nAffine Patch \nFamily \n\nCoupled Map \nFamily \n\nFigure 1: The structure of the three family discovery algorithms. \n\nferent accents. The learning algorithms presented in this paper could determine the \ndimension and structure of the parameterization. Given a sample of new speech, \nthe best-fitting accent model would be used for recognition. \n\nThe same paradigm applies to many other recognition tasks. For example, an OCR \nsystem could learn a parameterized family of font models (Revow, et. al., 1994). \nGiven new text, the system would identify the document's font parameters and use \nthe corresponding character recognizer. \n\nIn general, we use \"family discovery\" to refer to the task of learning the dimension \nand structure of a parameterized family of stochastic models. The methods we \npresent are equally applicable to parameterized density estimation, classification, \nregression, manifold learning, reinforcement learning, clustering, stochastic gram(cid:173)\nmar learning, and other stochastic settings. Here we only discuss classification and \nprimarily consider training examples which are explicitly partitioned into episodes. \n\nThis approach fits naturally into the neural network literature on \"meta-learning\" \n(Schmidhuber, 1995) and \"network transfer\" (Pratt, 1994). It may also be consid(cid:173)\nered as a particular case of the \"bias learning\" framework proposed by Baxter at \nthis conference (Baxter, 1996). \n\nThere are two primary alternatives to family discovery: 1) try to fit a single model \nto the data from all episodes or 2) use separate models for each episode. The first \napproach ignores the information that the different training sets came from distinct \nmodels. The second approach eliminates the possibility of inductive generalization \nfrom one set to another. \n\nIn Section 2, we present three algorithms for family discovery based on techniques \nfor \"surface learning\" (Bregler and Omohundro, 1994 and 1995). As shown in Figure \n1, the three alternative representations of the family are: 1) a single affine subspace \nof the parameter space, 2) a set of local affine patches smoothly blended together, \nand 3) a pair of coupled maps from the parameter space into the model space and \nback. In Section 3, we compare these three approaches to the two alternatives on a \nparameterized classification task. \n\n\f404 \n\nS. M. OMOHUNDRO \n\n2 THE FIVE ALGORITHMS \n\nLet the space of all classifiers under consideration be parameterized by 0 and assume \nthat different values of 0 correspond to different classifiers (ie. it is identifiable). For \nexample, 0 might represent the means, covariances, and class priors of a classifier \nwith normal class-conditional densities. O-space will typically have a much higher \ndimension than the parameterized family we are seeking. We write P9(X) for the \ntotal probability that the classifier 0 assigns to a labelled or unlabelled example x. \n\nThe true models are drawn from a d-dimensional family parameterized by , . Let the \ntraining set be partitioned into N episodes where episode i consists of Ni training \nexamples tij, 1 :S j :S Ni drawn from a single underlying model with parameter \n0:. A family discovery learning algorithm uses this training data to estimate the \nunderlying parameterized family. \n\nFrom a parameterized family, we may define the projection operator P from O-space \nto itself which takes each 0 to the closest member of the family. Using this projection \noperator, we may define a \"family prior\" on O-space which dies off exponentially \nwith the square distance of a model from the family mp(O) ex e-(9-P(9))2. Each \nof the family discovery algorithms chooses a family so as to maximize the posterior \nprobability of the training data with respect to this prior. If the data is very \nsparse, this MAP approximation to a full Bayesian solution can be supplemented \nby \"Occam\" terms (MacKay, 1995) or by using a Monte Carlo approximation. \n\nThe outer loop of each of the algorithms performs the optimization of the fit of the \ndata by re-estimation in a manner similar to the Expectation Maximization (EM) \napproach (Jordan and Jacobs, 1994). First, the training data in each episode i is \nindependently fit by a model Oi. Then the dimension of the family is determined \nas described later and the family projection operator P is chosen to maximize the \n\nprobability that the episode models Oi came from that family ni mp(Oi). The \n\nepisode models Oi are then re-estimated including the new prior probability mp. \nThese newly re-estimated models are influenced by the other episodes through mp \nand so exhibit training set \"transfer\". The re-estimation loop is repeated until \nnothing changes. \n\nThe learned family can then be used to classify a set of Ntest unlabelled test exam(cid:173)\nples Xk, 1 :S k :S Ntest drawn from a model O;est in the family. First, the parameter \nOtest is estimated by selecting the member of the family with the highest likelihood \non the test samples. This model is then used to perform the classification. A good \napproximation to the best-fit family member is often to take the image of the best-fit \nmodel in the entire O-space under the projection operator P. \n\nIn the next five sections, we describe the two alternative approaches and the three \nfamily discovery algorithms. They differ only in their choice of family representation \nas encoded in the projection operator P. \n\n2.1 The Single Model Approach \n\nThe first alternative approach is to train a single model on all of the training data. \nIt selects 0 to maximize the total likelihood L( 0) = n~l n~l P9 (tij ). New test \ndata is classified by this single selected model. \n\n\fFamily Discovery \n\n405 \n\n2.2 The Separate Models Approach \n\nThe second alternative approach fits separate models for each training }\u00a3isode. It \nchooses Bi for 1::; i::; N to maximize the episode likelihood Li(Bi) = TIj~IPIJ(tij). \nGiven new test data, it determines which of the individual models Bi fit best and \nclassifies the data with it. \n\n2.3 The Affine Algorithm \n\nThe affine model represents the underlying model family as an affine subspace of \nthe model parameter space. The projection operator Pal line projects a parameter \nvector B orthogonally onto the affine subspace. The subspace is determined by \nselecting the top principal vectors in a principal components analysis of the best(cid:173)\nfit episode model parameters. As described in (Bregler & Omohundro, 1994) the \ndimension is chosen by looking for a gap in the principal values. \n\n2.4 The Affine Patch Algorithm \n\nThe second family discovery algorithm is based on the \"surface learning\" proce(cid:173)\ndure described in (Bregler and Omohundro, 1994). The family is represented by \na collection of local affine patches which are blended together using Gaussian in(cid:173)\nfluence functions. The projection mapping Ppatch is a smooth convex combination \nof projections onto the affine patches Ppatch(B) = 2::=1 10: (B)Ao: (B) where Ao: is \nthe projection operator for an affine patch and Io:(B) = E:\"J:)(IJ) is a normalized \nGaussian blending function. \n\nThe patches are initialized using k-means clustering on the episode models to choose \nk patch centers. A local principal components analysis is performed on the episode \nmodels which are closest to each center. The family dimension is determined by \nexamining how the principal values scale as successive nearest neighbors are consid(cid:173)\nered. Each patch may be thought of as a \"pancake\" lying in the surface. Dimensions \nwhich belong to the surface grow quickly as more neighbors are considered while \ndimensions across the surface grow only because of the curvature of the surface. \n\nThe Gaussian influence functions and the affine patches are then updated by the \nEM algorithm (Jordan and Jacobs, 1994). With the affine patches held fixed, the \nGaussians Go: are refit to the errors each patch makes in approximating the episode \nmodels. Then with the Gaussians held fixed, the affine patches Ao: are refit to the \nepsiode models weighted by the the corresponding Gaussian Go:. Similar patches \nmay be merged together to form a more parsimonious model. \n\n2.5 The Coupled Map Algorithm \n\nThe affine patch approach has the virtue that it can represent topologically complex \nfamilies (eg. families representing physical objects might naturally be parameterized \nby the rotation group which is topologically a projective plane). It cannot, however, \nprovide an explicit parameterization of the family which is useful in some applica(cid:173)\ntions (eg. optimization searches). The third family discovery algorithm therefore \nattempts to directly learn a parameterization of the model family. \n\nRecall that the model parameters define B-space, while the family parameters de-\n\n\f406 \n\nS. M. OMOHUNDRO \n\nfine 'Y-space. We represent a family by a mapping G from B-space to 'Y-space to(cid:173)\ngether with a mapping F from 'Y-space back to B-space. The projection operation \nis Pmap(B) = F(G(B)). The map G(O) defines the family parameter l' on the full \nO-space. \n\nThis representation is similar to an \"auto-associator\" network in which we attempt \nto \"encode\" the best-fit episode parameters Oi in the lower dimensional 'Y-space \nby the mapping G in such a way that they can be correctly reconstructed by the \nfunction F. Unfortunately, if we try to train F and G using back-propagation on \nthe identity error function, we get no training data away from the family. There is \nno reason for G to project points away from the family to the closest family member. \nWe can rectify this by training F and G iteratively. First an arbitrary G is chosen \nand F is trained to send the images 'Yi = G(Oi) back to 0i' G is trained, however, \non images under F corrupted by additive spherical Gaussian noise! This provides \nsamples away from the family and on average the training signal sends each point \nin B space to the closest family member. \n\nTo avoid iterative training, our experiments used a simpler approach. G was taken to \nbe the affine projection operator defined by a global principal components analysis \nof the best-fit episode model parameters. Once G is defined, F is chosen to minimize \nthe difference between F(G(Oi)) and Oi for each best-fit episode parameter Oi. \n\nAny form of trainable nonlinear mapping could be used for F (eg. backprop neural \nnetworks or radial basis function networks). We represent F as a mixture of experts \n(Jordan and Jacobs, 1994) where each expert is an affine mapping and the mixture \ncoefficients are Gaussians. The mapping is trained by the EM algorithm. \n\n3 ALGORITHM COMPARISON \n\nTo compare these five algorithms, we consider a two-class classification task with \nunit-variance normal class-conditional distributions on a 5-dimensional feature \nspace. The means of the class distributions are parameterized by a nonlinear two(cid:173)\nparameter family: \n\nml = (1'1 + ~cos\u00a2\u00bbe~1 + ('Y2 + ~sin\u00a2\u00bbe~2 \nm2 = ('Yl - ~ cos \u00a2> ) e~1 + ('Y2 - ~ sin \u00a2> ) l2 . \n\nwhere 0 ~ 1'1, 1'2 ~ 10 and \u00a2> = ('Yl + 1'2)/3. The class means are kept at a unit \ndistance apart, ensuring significant class overlap over the whole family. The angle \n\u00a2> varies with the parameters so that the correct classification boundary changes \norientation over the family. This choice of parameters introduces sufficient non(cid:173)\nlinearity in the task to distinguish the non-linear algorithms from the linear one. \n\nFigure 1 shows the comparative performance of the 5 algorithms. The x-axis is the \ntotal number of training examples. Each set of examples consisted of approximately \nN = ..;x episodes of approximately Ni = ..;x examples each. The classifier param(cid:173)\neters for an episode were drawn uniformly from the classifier family. The episode \ntraining examples were then sampled from the chosen classifier according to the \nclassifier's distribution. Each of the 5 algorithms was then trained on these exam(cid:173)\nples. The number of patches in the surface patch algorithm and the number of affine \ncomponents in the surface map algorithm were both taken to be the square-root of \n\n\fFamily Discovery \n\n407 \n\n0.52 r---.---.---\"\"T\"\"----r----,-----r---r---~-__, \n\nI!? \ng \nw \n'0 \nc: \n0 :u \nI!! \nu. \n\n0.5 \n\n0.48 \n\n0.46 \n\n0.44 \n\n0.42 \n\n0.4 \n\n0.38 \n\n0.36 \n\n0.34 \n\n400 \n\n600 \n\n800 \n\n1000 \n\n1200 \n\nNumber of Examples \n\nSingle model -+(cid:173)\n\nSeparate models -+-_. \n\nAffine family -EJ -(cid:173)\n\nAffine Patch family \u00b7\u00b7x\u00b7\u00b7\u00b7\u00b7 \nMap Mixture family -A-.-\n\n1400 \n\n1600 \n\n1800 \n\n2000 \n\nFigure 2: A comparison of the 5 family discovery algorithms on the classification \ntask. \n\nthe number of training episodes. \n\nThe y-axis shows the percentage correct for each algorithm on an independent test \nset. Each test set consisted of 50 episodes of 50 examples each. The algorithms \nwere presented with unlabelled data and their classification predictions were then \ncompared with the correct classification label. \n\nThe results show significant improvement through the use of family discovery for \nthis classification task. The single model approach performed significantly worse \nthan any of the other approaches, especially for larger numbers of episodes (where \nthe family discovery becomes possible). The separate model approach improves with \nthe number of episodes, but is nearly always bested by the approaches which take \nexplicit account of the underlying parameterized family. Because of the nonlinearity \nin this task, the simple affine model performs more poorly than the two nonlinear \nmethods. It is simple to implement, however, and may well be the method of choice \nwhen the parameters aren 't so nonlinear. From this data, there is not a clear winner \nbetween the surface patch and surface map approaches. \n\n4 TRAINING SET DISCOVERY \n\nThroughout this paper, we have assumed that the training set was partitioned into \nepisodes by the teacher. Agents interacting with the world may not be given this \nexplicit information. For example, a speech recognition system may not be told \nwhen it is conversing with a new speaker. Similarly, a character recognition system \n\n\f408 \n\ns. M. OMOHUNDRO \n\nwould probably not be given explicit information about font changes. Learners can \nsometimes use the data itself to detect these changes, however. In many situations \nthere is a strong prior that successive events are likely to have come from a single \nmodel with only occasional model changes. The EM algorithm is often used for \nsegmenting unlabelled speech. It may be used in a similar manner to find the \ntraining set episode boundaries. First, a clustering algorithm is used to partition \nthe training examples into episodes. A parameterized family is then fit to these \nepisodes. The data is then repartitioned according to the similarity of the induced \nfamily parameters and the process is repeated until it converges. A similar approach \nmay be applied when the model parameters vary slowly with time rather than \noccasionally jumping discontinously. \n\nAcknowledgements \n\nI'd like to thank Chris Bregler for work on the affine patch approach to surface \nlearning, Alexander Linden for suggesting coupled maps for surface learning, and \nPeter Blicher for discussions. \n\nReferences \n\nBaxter, J. (1995) Learning model bias. This volume. \n\nBregler, C. & Omohundro, S. (1994) Surface learning with applications to lipread(cid:173)\ning. In J. Cowan, G. Tesauro and J. Alspector (eds.), Advances in Neural Infor(cid:173)\nmation Processing Systems 6, pp. 43-50. San Francisco, CA: Morgan Kaufmann \nPublishers. \n\nBregler, C. & Omohundro, S. (1995) Nonlinear image interpolation using manifold \nlearning. In G. Tesauro, D. Touretzky and T. Leen (eds.), Advances in Neural \nInformation Processing Systems 7. Cambridge, MA: MIT Press. \n\nBregler, C. & Omohundro, S. (1995) Nonlinear manifold learning for visual speech \nrecognition. In W . Grimson (ed.), Proceedings of the Fifth International Conference \non Computer Vision. \n\nJordan, M. & Jacobs, R. (1994) Hierarchical mixtures of experts and the EM algo(cid:173)\nrithm. Neural Computation, 6:181-214. \n\nMacKay, D. (1995) Probable networks and plausible predictions - a review of prac(cid:173)\ntical Bayesian methods for supervised neural networks. Network, to appear. \n\nPratt, L. (1994) Experiments on the transfer of knowledge between neural networks. \nIn S. Hanson, G. Drastal, and R. Rivest (eds.) , Computational Learning Theory and \nNatural Learning Systems, Constraints and Prospects, pp. 523-560. Cambridge, \nMA: MIT Press. \n\nRevow, M., Williams, C. and Hinton, G. (1994) Using generative models for hand(cid:173)\nwritten digit recognition. Technical report, University of Toronto. \n\nSchmidhuber, J. (1995) On learning how to learn learning strategies. Technical \nReport FKI-198-94, Fakultat fur Informatik, Technische Universitat Munchen. \n\n\f", "award": [], "sourceid": 1050, "authors": [{"given_name": "Stephen", "family_name": "Omohundro", "institution": null}]}