Part of Advances in Neural Information Processing Systems 4 (NIPS 1991)
J. B. Hampshire II, B. Kumar
We compare two strategies for training connectionist (as well as non(cid:173) connectionist) models for statistical pattern recognition. The probabilistic strat(cid:173) egy is based on the notion that Bayesian discrimination (i.e .• optimal classifica(cid:173) tion) is achieved when the classifier learns the a posteriori class distributions of the random feature vector. The differential strategy is based on the notion that the identity of the largest class a posteriori probability of the feature vector is all that is needed to achieve Bayesian discrimination. Each strategy is directly linked to a family of objective functions that can be used in the supervised training procedure. We prove that the probabilistic strategy - linked with error measure objective functions such as mean-squared-error and cross-entropy - typically used to train classifiers necessarily requires larger training sets and more complex classifier architectures than those needed to approximate the Bayesian discrim(cid:173) linked inant function. In contrast. we prove that the differential strategy - with classificationfigure-of-merit objective functions (CF'MmoIlO) [3] - requires the minimum classifier functional complexity and the fewest training examples necessary to approximate the Bayesian discriminant function with specified pre(cid:173) cision (measured in probability of error). We present our proofs in the context of a game of chance in which an unfair C-sided die is tossed repeatedly. We show that this rigged game of dice is a paradigm at the root of all statistical pattern recognition tasks. and demonstrate how a simple extension of the concept leads us to a general information-theoretic model of sample complexity for statistical pattern recognition.