{"title": "Shooting Craps in Search of an Optimal Strategy for Training Connectionist Pattern Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 1125, "page_last": 1132, "abstract": null, "full_text": "Shooting Craps in Search of an Optimal Strategy for \n\nTraining Connectionist Pattern Classifiers \n\nJ. B. Hampshire IT \n\nand B. V. K. Vijaya Kumar \n\nDepartment of Electrical & Computer Engineering \n\nCarnegie Mellon University \nPittsbwgh. PA 15213-3890 \n\nhamps@speechl.cs.cmu.edu \n\nand \n\nkumar@gauss.ece.cmu.edu \n\nAbstract \n\nWe compare two strategies for training connectionist (as well as non(cid:173)\nconnectionist) models for statistical pattern recognition. The probabilistic strat(cid:173)\negy is based on the notion that Bayesian discrimination (i.e .\u2022 optimal classifica(cid:173)\ntion) is achieved when the classifier learns the a posteriori class distributions of \nthe random feature vector. The differential strategy is based on the notion that \nthe identity of the largest class a posteriori probability of the feature vector is \nall that is needed to achieve Bayesian discrimination. Each strategy is directly \nlinked to a family of objective functions that can be used in the supervised training \nprocedure. We prove that the probabilistic strategy -\nlinked with error measure \nobjective functions such as mean-squared-error and cross-entropy -\ntypically \nused to train classifiers necessarily requires larger training sets and more complex \nclassifier architectures than those needed to approximate the Bayesian discrim(cid:173)\nlinked \ninant function. \nIn contrast. we prove that the differential strategy -\nwith classificationfigure-of-merit objective functions (CF'MmoIlO) [3] -\nrequires \nthe minimum classifier functional complexity and the fewest training examples \nnecessary to approximate the Bayesian discriminant function with specified pre(cid:173)\ncision (measured in probability of error). We present our proofs in the context of \na game of chance in which an unfair C-sided die is tossed repeatedly. We show \nthat this rigged game of dice is a paradigm at the root of all statistical pattern \nrecognition tasks. and demonstrate how a simple extension of the concept leads \nus to a general information-theoretic model of sample complexity for statistical \npattern recognition. \n\n1125 \n\n\f1126 \n\nHampshire and Kumar \n\n1 \n\nIntroduction \n\nCreating a connectionist pattern classifier that generalizes well to novel test data has recently \nfocussed on the process of finding the network architecture with the minimum functional \ncomplexity necessary to model the training data accurately (see, for example, the works of \nBaum. Cover, Haussler, and Vapnik). Meanwhile, relatively little attention has been paid to \nthe effect on generalization of the objective function used to train the classifier. In fact, the \nchoice of objective function used to train the classifier is tantamount to a choice of training \nstrategy, as described in the abstract [2,3]. \nWe formulate the proofs outlined in the abstract in the context of a rigged game of dice in \nwhich an unfair C-sided die is tossed repeatedl y. Each face of the die has some probability of \nturning up. We assume that one face is always more likely than all the others. As a result, all \nthe probabilities may be different, but at most C - 1 of them can be identical. The objective \nof the game is to identify the most likely die face with specified high confidence. The \nrelationship between this rigged dice paradigm and statistical pattern recognition becomes \nclear if one realizes that a single unfair die is analogous to a specific point on the domain \nof the randomfeature vector being classified. Just as there are specific class probabilities \nassociated with each point in feature vector space, each die has specific probabilities \nassociated with each of its faces. The number of faces on the die equals the number of \nclasses associated with the analogous point in feature vector space. Identifying the most \nlikely die face is equivalent to identifying the maximum class a posteriori probability for \nthe analogous point in feature vector space -\nthe requirement for Bayesian discrimination. \nWe formulate our proofs for the case of a single die, and conclude by showing how a simple \nextension of the mathematics leads to general expressions for pattern recognition involving \nboth discrete and continuous random feature vectors. \nAuthors' Note: In the interest of brevity, our proofs are posed as answers to questions that \npertain to the rigged game of dice. It is hoped that the reader will find the relevance of \neach question/answer to statistical pattern recognition clear. Owing to page limitations, we \ncannot provide our proofs in full detail; the reader seeking such detail should refer to [1], \nDefinitions of symbols used in the following proofs are given in table 1. \n\n1.1 A Fixed-Point Representation \n\nThe Mq-bit approximation qM[X] to the real number x E (-1, 1] is of the form \n\nMSB (most significant bit) = sign [x] \n\nMSB - 1 = 2- 1 \n\n! \n\nLSB (least significant bit) = 2-(M.-1) \n\n(1) \n\nwith the specific value defined as the mid-point of the 2-(M.-1) -wide interval in which x \nis located: \n\nsign[x] . (L Ixl . 2(M.-l) J . 2-(M.-1) + 2-M.) , \n\nA \nqM[X] = \n\n{ \n\nsign [x] . (1 - 2-M.) , \n\nThe lower and upper bounds on the quantization interval are \n\nLM.[X] < x < UM.[X] \n\nIxl < 1 \n\nIxl = 1 \n\n(2) \n\n(3) \n\n\fAn Optimal Strategy for Training Connectionist Pattern Classifiers \n\n1127 \n\nThble 1: Definitions of symbols used to describe die faces, probabilities, probabilistic \ndifferences, and associated estimates. \n\nSymbol \n\nDefinition \n\n(note n denotes the sample size) \n\nThe lrUe jth most likely die face (w;j is the estimated jth most likely face). \nThe probability of the lrUe jth most likely die face. \nThe number of occurrences of the true jth most likely die face. \nAn empirical estimate of the probability of the true jth most likely die face: \nP(wrj) =!c;. \nThe probabilistic difference involving the true rankings and probabilities of the \nC die faces: \n..1ri = P(Wri) -\nThe probabilistic difference involving the true rankings but empirically estimated \nprobabilities of the C die faces: \n\" \n..1ri = P(Wri) -\n\nk,; - lupifi ~ \n\nSUPj,..i P(Wrj) \n\nsUPhfj P(wrj) = \n\n\u2022 \n\nII \n\nWrj \nP(Wrj) \nkq \nP(Wrj) \n\nwhere \n\nand \n\n(4) \n\n(5) \n\nThe fixed-point representation described by (1) -\n(5) differs from standard fixed-point \nrepresentations in its choice of quantization interval. The choice of (2) - (5) represents zero \nas a negative - more precisely, a non-positive -\nfinite precision number. See [1] for the \nmotivation of this format choice. \n\n1.2 A Mathematical Comparison of the Probabilistic and Differential Strategies \n\nThe probabilistic strategy for identifying the most likely face on a die with C faces involves \nestimating the C face probabilities. In order for us to distinguish P(Wrl) from P(Wr2) , we \nmust choose Mq (i.e. the number of bits in our fiXed-point representation of the estimated \nprobabilities) such that \n\n(6) \n\nThe distinction between the differential and probabilistic strategies is made more clear if \none considers the way in which the Mq-bit approximation jrl is computed from a random \nsample containing krl occurrences of die face Wrl and kl'2. occurrences of die face Wr2. For \nthe differential strategy \n\n.1rl dijferelltUJl = qM [krl : kl'2.] \n\nand for the probabilistic strategy \n\n.1rl probabilistic \n\n(7) \n\n(8) \n\n\f1128 \n\nHampshire and Kumar \n\nwhere \n\n6. \n\n.d; \n\nP(Wi) -\n\nsup P(Wj) \niii \n\ni = 1,2, ... C \n\nNote that when i = rl \n\nand when i ::f r 1 \n\nNote also \n\nSince \n\ni=1 \n\n(C - 2) P(Wrl) \n\nrC \n\ni..,.3 \n\n= L P(Wj) -\n= ~ [1 - t .di ] \n\nizr2 \n\nC \n\n(9) \n\n(10) \n\n(ll) \n\n(12) \n\n(13) \n\n(14) \n\nwe can show that the C differences of (9) yield the C probabilities by \n\nP(Wrj) = \n\n.drj + P(Wrl) Vj > 1 \n\nThus, estimating the C differences of (9) is equivalent to estimating the C probabilities \nP(Wl), P(W2) , ... ,P(wc). \n\nClearly, the sign of L1rl in (7) is modeled correctly (i.e., L1rl differentWl can correctly identify \nthe most likely face) when Mq = 1, while this is typically not the case for .drl probabilistic \nin (8). In the latter case, L1rl probabilistic is zero when Mq = 1 because qm[p(Wrl)] and \nQM[P(Wr'2)] are indistinguishable for Mq below some minimal value implied by (6). That \nminimal value of Mq can be found by recognizing that the number of bits necessary for (6) \nto hold for asymptotically large n (Le., for the quantized difference in (8) to exceed one \nLSB) is \n\n1 \n\n+ r -log2 [.drd 1, \n~\" J \nmagnit~de bits \n\nsign bit \n~ + J -log2 [.drd 1 + 1) \nsign bit \n\nmagnit~de bits \n\n-log2 [P(Wrj)] \n\n:3 Z+ \n\nj E {1,2} \n\notherwise \n\n(15) \nwhere Z+ represents the set of all positive integers. Note that the conditional nature of \n\u20ac = LM. [P(Wrl)] or P(Wr2) = \nMq min in (15) prevents the case in which lime-+o P(Wrl) -\nUM.[P(Wr2)]; either case would require an infinitely large sample size before the variance \nof the corresponding estimated probability became small enough to distinguish QM[P(Wrl)] \nfrom QM[P(Wr'2)]. The sign bit in (15)is not required to estimate the probabilities themselves \nin (8), but it is necessary to compute the difference between the two probabilities in that \nequation -\nthis difference being the ultimate computation by which we choose the most \nlikely die face. \n\n\fAn Optimal Strategy for Training Connectionist Pattern Classifiers \n\n1129 \n\n1.3 The Sample Complexity Product \n\nWe introduce the sample complexity product (SCP) as a measure of both the number of \nsamples and the functional complexity (measured in bits) required to identify the most \nlikely face of an unfair die with specified probability. \n\nA \n\nSCP = n . Mq \n\ns.t. P(most likely face correctly IO'd) ~ a \n\n(16) \n\n2 A Comparison of the Sample Complexity Requirements for the \n\nProbabilistic and Differential Strategies \n\nAxiom 1 We view the number of bits Mq in the finite-precision approximation qM[X] to \nthe real number x E (-1, 1] as a measure of the approximation's functional complexity. \nThat is, the functional complexity of an approximation is the number of bits with which it \nrepresents a real number on (-1, 1]. \n\nAssumption 1 If P(Wrl) > P(Wr2), then P(Wrl) will be greater than P(Wrj) Vj> 2 (see [1] \nfor an analysis of cases in which this assumption is invalid). \n\nQuestion: What is the probability that the most likely face of an unfair die will be empiri(cid:173)\ncally identifiable after n tosses? \n\nAnswer for the probabilistic strategy: \n\nP (qM[P(Wrl)] > qM[P(Wrj)] , V j > 1) \n~ n! t P(Wrl)k., [t P(Wr2)~ (1 - P(Wrl) - P(Wr2))(II-k., -~)] (17) \n\nkr2! (n - krl - kr2)! \n\nk.1=>', \n\nwhere \n\nkrl! \n\n~=>'l \n\nAl = max ( B + 1, nC -_k~ + 1 ) \nVl = n \nA2 = 0 \nV]. = min (B , n - krl ) \nB = \n\n{BM9} = kUN9 [P(Wr2)] = kt..vq [P(Wrl)] - 1 \n\nVC > 2 \n\n(18) \n\nThere is a simple recursion in [1] by which every possible boundary for M q-bit quantization \nleads to itself and two additional boundaries in the set {BM9} for (Mq + I)-bit quantization. \nAnswer for the differential strategy: \n\n\f1130 \n\nHampshire and Kumar \n\nwhere \n\n( \n\nn -k~ \n\nmax \n\n) \nkL.vt [Llrd, C _ 1 + 1 \n\nAl = \nVI = n \nA2 = max ( 0 , krl - kUJft [LlrlJ ) \n\n\\lC > 2 \n\n(20) \n\nVl = min (krl - kr.Jft [Llrd , n - krl ) \n\nSince the multinomial distribution is positive semi-definite, it should be clear from a \ncomparisonof(17)-(18) and (19)-(20) thatP (LMt[Llrd < Lirl < UMt[Llrtl) islargest \n(and larger than any possible P (qM[P(Wrl)] > qM[P(Wrj)] , \\I j > 1) ) for a given sample \nsize n when the differential strategy is employed with Mq = 1 such that LMt [Llrtl = 0 and \nUM [Llrd = 1 (Le., lr. \nALNt \n\n[Llrtl = 1 and ku~ [Llrd = n). The converse is also true, to wit: \n\n-t \n\nt \n\nTheorem 1 For aftxed value ofn in (19), the l-bitapproximationto Llrl yields the highest \nprobability of identifying the most likely die face Wrl . \n\nIt can be shown that theorem 1 does not depend on the validity of assumption 1 [1]. Given \nAxiom 1, the following corollary to theorem 1 holds: \n\nCorollary 1 The differential strategy's minimum-complexity l-bit approximation of Llrl \nyields the highest probability of identifying the most likely die face Wrl for a given number \nof tosses n. \n\nCorollary 2 The differential strategy's minimum-complexity l-bit approximation of Llrl \nrequires the smallest sample size necessary (nmi,,) to identify P(Wrl) -and thereby the most \nlikely die face Wrl -\ncorrectly with specified confidence. Thus, the differential strategy \nrequires the minimum SCP necessary to identify the most likely die face with specified \nconfidence. \n\n2.1 Theoretical Predictions versus Empirical Results \n\nFigures 1 and 2 compare theoretical predictions of the number of samples n and the number \nof bits Mq necessary to identify the most likely face of a particular die versus the actual \nrequirements obtained from 1000 games (3000 tosses of the die in each game). The die has \nfive faces with probabilities P(Wrl) = 0.37 ,P(Wr2) = 0.28, P(Wr3) = 0.2, P(Wr4) = 0.1 ,and \nP(Wrl) = 0.05. The theoretical predictions for Mq and n (arrows with boxed labels based \non iterative searches employing equations (17) and (19)) that would with 0.95 confidence \ncorrectly identify the most likely die face Wrl are shown to correspond with the empirical \nresults: in figure 1 the empirical 0.95 confidence interval is marked by the lower bound of \nthe dark gray and the upper bound of the light gray; in figure 2 the empirical 0.95 confidence \ninterval is marked by the lower bound of the P(Wrl) distribution and the upper bound of the \n\n\fAn Optimal Strategy for Training Connectionist Pattern Classifiers \n\n1131 \n\nO~~f __ ------------__ - -__ - -__ \n\n-G.l.'Ot;;:;;.-____ ____ __ _ _ \n\n~ 1000 \n\n1~ 2000 \n\n2~ 3000 \n\nFigUre 1: Theoretical predictions of the \nnumber of tosses needed to identify the \nmost likely face Wrl with 95% confidence \n(Die 1): Differential strategy prediction su(cid:173)\nperimposed on empirical results of 1000 \ngames (3000 tosses each). \n\nFigure 2: Theoretical predictions of the \nnumber of tosses needed to identify the \nmost likely face Wrl with 95% confidence \n(Die 1): Probabilistic strategy prediction \nsuperimposed on empirical results of 1000 \ngames (3000 tosses each). \n\nP(Wr2) distribution. These figures illustrate that the differential strategy's minimum SCP \nis 227 (n = 227, Mq = 1) while the minimum SCP for the probabilistic strategy is 2720 \n(n = 544 , Mq = 5). A complete tabulation of SCP as a function of P(Wrl) , P(Wr2) , and the \nworst-case choice for C (the number of classes/die faces) is given in [1]. \n\n3 Conclusion \n\nThe sample complexity product (SCP) notion of functional complexity set forth herein is \nclosely aligned with the complexity measures of Kolmogorov and Rissanen [4, 6]. We have \nused it to prove that the differential strategy for learning the Bayesian discriminant function \nis optimal in terms of its minimum requirements for classifier functional complexity and \nnumber of training examples when the classification task is identifying the most likely face \nof an unfair die. It is relatively straightforward to extend theorem 1 and its corollaries to \nthe general pattern recognition case in order to show that the expected SCP for the I-bit \ndifferential strategy \n\nE [SCP]diffenntial ~ Ix nmin [p (Wrl I x) , P (wr21 x)] '~q min [p (Wrl I~) , P (wr21 x) tp{x)dx \n\n(21) \n(or the discrete random vector analog of this equation) is minimal [1]. This is because nmin \nis by corollary 2 the smallest sample size necessary to distinguish any and all P(Wrl) from \n\n=1 \n\n\f1132 \n\nHampshire and Kumar \n\nlesser P(Wr2). The resulting analysis confinns that the classifier trained with the differential \nstrategy for statistical pattern recognition (Le., using a CFMmoM objective function) has \nthe highest probability of learning the Bayesian discriminant function when the functional \ncapacity of the classifier and the available training data are both limited. \n\nThe relevance of this work to the process of designing and training robust connec(cid:173)\ntionist pattern classifiers is evident if one considers the practical meaning of the terms \nnmilt [p (Wrl I x) , P (wr21 x)] and Mq mill [p (Wrl I x) , P (wr21 x)] in the sample complex(cid:173)\nity product of (21). Oi ven one's choice of connectionist model to employ as a classifier, the \nM q milt term dictates the minimum necessary connectivity of that model. For example, (21) \ncan be used to prove that a partially connected radial basis function (RBF) with trainable \nvariance parameters and three hidden layer ''nodes'' has the minimum Mq necessary for \nBayesian discrimination in the 3-class task described by [5]. However, because both SCP \nterms are functions of the probabilistic nature of the random feature vector being classified \nand the learning strategy employed. that minimal RBF architecture will only yield Bayesian \ndiscrimination if trained using the differential strategy. The probabilistic strategy requires \nsignificantly more functional complexity in the RBF in order to meet the requirements \nof the probabilistic strategy's SCP [1]. Philosophical arguments regarding the use of the \ndifferential strategy in lieu of the more traditional probabilistic strategy are discussed at \nlength in [1]. \n\nAcknowledgement \n\nThis research was funded by the Air Force Office of Scientific Research under grant \nAFOSR-89-0551. We gratefully acknowledge their support. \n\nReferences \n\n[1] J. B. Hampshire II. A Differential Theory of Statistical Pattern Recognition. PhD \nthesis, Carnegie Mellon University, Department of ElectricaI & Computer Engineering, \nHammerschlag Hall, Pittsburgh. PA 15213-3890,1992. manuscript in progress. \n\n[2] J. B. Hampshire II and B. A. Pearlmutter. Equivalence Proofs for Multi-Layer Per(cid:173)\n\nceptton Classifiers and the Bayesian Discriminant Function. In Touretzky, Elman. \nSejnowski, and Hinton, editors, Proceedings of the 1990 Connectionist Models Sum(cid:173)\nmer School. pages 159-172. San Mateo, CA, 1991. Morgan-Kaufmann. \n\n[3] J. B. Hampshire II and A. H. Waibel. A Novel Objective Function for Improved \nPhoneme Recognition Using Time-Delay Neural Networks. IEEE Transactions on \nNeural Networks, 1(2):216-228, June 1990. A revised and extended version of work \nfirst presented at the 1989 International Joint Conference on Neural Networks, vol. I. \npp.235-241. \n\n[4] A. N. Kolmogorov. Three Approaches to the Quantitative Definition of Information. \nProblems of Information Transmission. 1(1):1-7, Jan. - Mar. 1965. Faraday Press \nttanslation of Problemy Peredachi Informatsii. \n\n[5] M. D. Richard and R. P. Lippmann. Neural Network Classifiers Estimate Bayesian a \n\nposteriori Probabilities. Neural Computation, 3(4):461-483.1991. \n\n[6] J. Rissanen. Modeling by shortest data description. Automatica, 14:465-471,1978. \n\n\f", "award": [], "sourceid": 456, "authors": [{"given_name": "J. B.", "family_name": "Hampshire II", "institution": null}, {"given_name": "B.", "family_name": "Kumar", "institution": null}]}